Skip to content

Analyzing Text

Probabilistic Models

Natural Language Processing (NLP) enables computers to interact with human languages, powering tools like chatbots and translation services. The core mechanism behind many NLP tasks is the Probabilistic Language Model, which assigns a probability score \(P(w_1, w_2, ..., w_k)\) to a sequence of words.

Why Probability Matters We need to assign probabilities to sentences to rank competing alternatives in uncertain situations:

  • Machine Translation: A model determines that \(P(\text{"High winds"}) > P(\text{"Large winds"})\) to select the most natural translation.
  • Spell Checking: It distinguishes between homophones by context, knowing that \(P(\text{"great city"}) > P(\text{"grate city"})\).
  • Speech Recognition: It resolves similar sounds, preferring "I saw a van" over "Eyes awe of an".

How LLMs Work

At their core, Large Language Models (LLMs) operate as iterative probabilistic generators. The process follows a simple loop:

  1. Context Analysis: The model looks at the sequence of words generated so far (\(w_1, w_2, ... w_{k-1}\)).
  2. Probability Distribution: It calculates a probability distribution for what the next word (\(w_k\)) should be.
  3. Sampling: It selects (samples) a specific word from that distribution based on its likelihood.
  4. Iteration: This process repeats one word at a time until the model generates a special "stop" token.

Note: It emphasizes that while this loop is simple, the "tricky part" is accurately generating the probability distribution and choosing the right sampling technique.

The Chain Rule and Intractability

To calculate the probability of a full sentence, we apply the Chain Rule of Probability, which states that the probability of a sequence is the product of the conditional probabilities of each word given its entire history:

\[ P(w_1...w_k) = P(w_1) \times P(w_2 | w_1) \times P(w_3 | w_1, w_2) \times ... \times P(w_k | w_1...w_{k-1}) \]

The Problem of Intractability While mathematically sound, this approach is computationally impossible (intractable) for real-world language.

  • Unbounded History: The sentence length is theoretically infinite.
  • Data Sparsity: Even assuming a limit of 20 words and a vocabulary of 100,000 words, there are \(100,000^{20} = 10^{100}\) possible sentences. We could never collect enough text data to estimate the probabilities for every unique 20-word combination.

Large Language Models (LLMs) operate as iterative prediction engines that generate text one word at a time by calculating the probability of the next token based on the preceding sequence. However, because accounting for an infinite history of words is mathematically intractable (impossible due to the sheer number of combinations), LLMs do not actually look at the entire history. Instead, they rely on a finite context window (effectively functioning as massive N-Gram models), where they only analyze a fixed number of recent tokens to approximate the probability distribution for the next word.

The N-Gram Solution (Markov Assumption)

To solve the intractability problem, we simplify the model using the Markov Assumption. We assume that the probability of a word depends only on the previous \(N-1\) words, rather than the entire history.

Model Types

  • Unigram (\(N=1\)): Assumes total independence. The probability of a sentence is just the product of the individual word frequencies.
\[ P(w_1, w_2, ...) \approx P(w_1) \times P(w_2) \times ... \]
  • Bigram (\(N=2\)): Assumes a word depends only on the single previous word.
\[ P(w_k | w_1...w_{k-1}) \approx P(w_k | w_{k-1}) \]
  • Real World Usage: Google Search suggestions use these N-Grams (where N is small) to predict your next keyword.

Calculating Probabilities (Maximum Likelihood Estimation)

We estimate these probabilities by counting occurrences in a training corpus (text data). This is a simple frequency calculation often done using Hadoop/Spark.

Unigram Model (Individual Word Probability)

The probability of a single word $\(w\)$ is simply its frequency relative to the total number of words in the corpus (\(N\)).

\[ P(w) = \frac{C(w)}{N} \]
  • \(C(w)\): The count of times word \(w\) appears.
  • \(N\): The total number of tokens (words) in the corpus.
  • Derivation: If the word "the" appears 1,000 times in a 10,000-word document, \(P(\text{the}) = 1000/10000 = 0.1\).

Bigram Model (Conditional Probability)

We want to find the probability of word \(w_j\) appearing given that the previous word was \(w_i\). This is written as \(P(w_j | w_i)\).

The Derivation: By the definition of conditional probability:

\[ P(B | A) = \frac{P(A \cap B)}{P(A)} \]

Applying this to our words (\(A = w_i\), \(B = w_j\)):

\[ P(w_j | w_i) = \frac{P(w_i, w_j)}{P(w_i)} \]

Now, substitute the Unigram formula from above:

  • Numerator: \(P(w_i, w_j) = \frac{C(w_i, w_j)}{N}\)
  • Denominator: \(P(w_i) = \frac{C(w_i)}{N}\)
\[ P(w_j | w_i) = \frac{C(w_i, w_j) / N}{C(w_i) / N} \]

The \(N\) terms cancel out, leaving the final simple formula used in Hadoop/Spark:

\[ P(w_j | w_i) = \frac{C(w_i, w_j)}{C(w_i)} \]
  • Logic: To find the probability of "Ham" following "Green", you simply count how many times "Green Ham" appears together and divide it by the total times "Green" appears alone.

Worked Example: "Sam I Am"

Given a corpus of three sentences:

  1. ^ I am Sam $
  2. ^ Sam I am $
  3. ^ I do not like green eggs and ham $

Calculating Initial Probabilities To determine the probability of a sentence starting with "I" vs "Sam", we look at the start tokens ^.

  • Count(^): 3 (There are 3 sentences).
  • Count(^, I): 2 (Sentences 1 and 3 start with "I").
  • Result: \(P(\text{I} | \^{}) = 2/3\).

The Zero Probability Problem (Sparsity) If we try to calculate the probability of the sentence "I like ham", we multiply the bigrams:

\[ P(\text{I}|\^{}) \times P(\text{like}|\text{I}) \times P(\text{ham}|\text{like}) \times P(\$|\text{ham}) \]
  • However, if the phrase "like ham" never appears in our small training corpus, \(P(\text{ham}|\text{like}) = 0\).
  • This makes the probability of the entire sentence 0, regardless of how common the other parts are.

Handling "Impossible" Sentences: The Zero Probability Problem

In probabilistic models (like N-Grams), a major issue arises when a valid sentence contains a sequence of words that never appeared in the training data.

  • The Problem: If a specific N-gram (e.g., "purple pirate") has a count of 0, the probability \(P(\text{sentence})\) becomes 0 because probabilities in the Chain Rule are multiplied. This suggests the sentence is impossible, even if it is merely rare.
  • The Discontinuity: A single unusual word can instantly crash a sentence's probability from "likely" to "impossible." To fix this, we need to "smooth" the distribution by removing zeros.

The Solution: Laplace Smoothing

Laplace Smoothing (or Add-One Smoothing) solves the zero problem by pretending we have seen every possible event at least once.

  • The Mechanism: We start every count at 1 instead of 0.
    • Example: If we never saw (like, ham), we pretend we saw it once. If we saw (^, I) twice, we count it as 3.
  • The Formula (Joint Probability): To calculate the probability of a bigram pair \((A, B)\) occurring, we adjust the denominator to account for the phantom counts we added.

    \[ P(A, B) = \frac{C(A, B) + 1}{N + V^2} \]
    • \(C(A, B)\): The actual count of the pair.
    • \(N\): Total number of bigrams observed.
    • \(V^2\): We add \(V^2\) because there are \(V\) unique words, meaning there are $\(V \times V\) $possible unique pairs. Since we added \(+1\) to the numerator for every possible pair, we must add \(V^2\) to the total sum in the denominator to keep probabilities valid.
    • Why \(V^2\) and not \(V(V-1)\)?
    • A common misconception is that we should exclude pairs where a word appears with itself (like (X, X)). However, in language, self-pairs are valid and important.
    • Example: "John had 'had', while Carol had 'had had'." In this sentence, the bigram (had, had) is structurally critical.
    • Because any word can follow any other word (including itself), the full sample space is the cartesian product \(V \times V\), not the permutation \(V(V-1)\).

Hidden Markov Models (HMM)

Before deep learning, HMMs were a dominant technique in NLP and Bioinformatics.

  • Use Cases:
    • Phoneme Recognition: Converting a stream of audio sounds into a stream of words.
    • Part-of-Speech (PoS) Tagging: Assigning grammatical context (e.g., noun, verb) to words.
      • Example: In the sentence "Buffalo buffalo Buffalo...", HMMs help determine which "Buffalo" is a city, which is an animal, and which is a verb.
  • Parallelism: HMM training is iterative, making it a good candidate for Spark, though it is not used in the course assignments.

Modern Solutions: Transformers

While N-Grams struggle with context and zeros, modern Large Language Models (LLMs) like GPT use Transformers.

  • Solving the Zero Problem: Transformers are Neural Networks, not tables of counters. They assign latent probabilities to words, meaning no valid sequence ever has a true "zero" probability, so smoothing is not needed.
    • “Latent probabilities” in transformers means the model doesn’t store explicit word counts or lookup tables like traditional n-gram models; instead, it learns continuous internal representations (vectors) and uses neural network layers plus a softmax output to produce a probability distribution over the next token. Because these probabilities are generated from learned weights and smooth mathematical functions, every possible token gets some small non-zero probability unless it is explicitly masked out, even if the exact sequence was never seen in training. In older count-based language models, unseen word combinations would literally have probability 0 and required “smoothing” tricks to avoid impossible predictions; in transformers, the probabilities are implicit (“latent”) in the network’s parameters and naturally spread across the vocabulary, so the zero-probability problem largely disappears without needing manual smoothing methods.
  • Solving the Context Problem: Instead of a rigid N-gram window, they use Self-Attention. This allows the model to look at a massive context (e.g., 4,000+ words) and dynamically decide which specific previous words are important for the current prediction.
    • “Solving the context problem” means that, unlike older n-gram language models which could only look at a small fixed window of the last N words (for example the previous 3–5 tokens), transformers use self-attention to consider a very large portion of the prior text at once—thousands of tokens instead of just a few. Self-attention works by letting each word in the sequence compare itself to every other previous word and assign different importance weights, so the model can dynamically decide which earlier terms actually matter for predicting the next token. Instead of a rigid, sliding window that forgets distant information, the model has a flexible, global view of context and can focus strongly on a keyword from many sentences ago while mostly ignoring irrelevant nearby words, which is why transformers handle long passages and complex dependencies much better than traditional n-gram approaches.

Information Retrieval: Searching

We are shifting topics from probabilistic generation to Information Retrieval (IR)—the science of searching. This is the mechanism behind search engines like Google.

The core challenge in IR is the vocabulary mismatch between the user and the author.

  • The User: Has a concept in their head (e.g., "tragic love story") and translates it into a query.
  • The Document: Contains concepts expressed by an author (e.g., "Shakespeare") using totally different words like "fateful star-crossed romance".
  • The Mismatch: Even though the concepts match perfectly, the words share no overlap. The system must bridge this gap because computers generally rely on exact keyword matching rather than conceptual understanding.

Abstract IR Architecture

image.png

The architecture of a search engine is divided into two distinct phases: Offline (Preparation) and Online (Live Searching).

1. Offline Phase (The Backend)

  • Document Acquisition: The system crawls the web to gather raw documents.
  • Representation Function: It transforms raw text into a machine-readable format (e.g., stripping HTML, removing punctuation).
    • Embedding: a numerical representation of a document/query.
  • Index: The processed representations are stored in an efficient Index for fast lookup.
    • The organized system that stores and retrieves those embeddings (or tokens) efficiently.

2. Online Phase (The User Interaction)

  • Query Representation: When a user searches, their query goes through the same representation function (normalization) to ensure it matches the format of the index.
  • Comparison Function: The system compares the processed query against the Index to find matches.
  • Hits: The final output is a ranked list of relevant documents.

Representation Matters: "Bag of Words"

Since computers don't truly "understand" language, we must define what "relevant" means mathematically.

  • Bag of Words (BoW): The simplest representation. We treat a document as a loose collection of words, ignoring grammar and order.
  • The Assumptions: This assumes that terms are independent and that the concept of a "word" is well-defined. While these assumptions are technically wrong (context matters!), they are useful approximations for building search engines.

The Complexity of Language

Defining "what is a word" is deceptively difficult outside of English.

  • Multilingual Challenges: Languages like Chinese, Arabic, and Japanese have completely different rules for sentence segmentation and word boundaries compared to English or Russian.

Text Normalization (English)

Even in English, we must standardize text so that queries match documents.

  • Tokenization: The process of splitting text into words (tokens), usually by removing punctuation.
  • Case Folding: Converting everything to a common case (usually lower-case).
    • The Ambiguity: This can lose meaning. "Bush" (the President) becomes "bush" (the plant). "Jack Black" (the actor) becomes "jack black" (a device for lifting cars?).
  • Unicode Normalization: Handling special characters. For example, "é" can be stored as a single character or as "e" + "acute accent." The system must enforce a "canonical form" so that coöp, co-op, and coop are treated consistently.

Case folding + Unicode normalization = make text uniform so visually or stylistically different versions of the same word are treated as the same token.

Information Retrieval: Indexing & Scaling

Efficient searching requires moving away from simple text scanning (grep) to specialized data structures. This section covers how to build these structures at scale.

The Inverted Index Structure

To search effectively, we must map Content \(\rightarrow\) Documents.

The Matrix vs. The List

  • The Matrix View: conceptually, we can view the data as a table where rows are terms (e.g., "blue", "cat") and columns are documents.
    • The Problem: This matrix is incredibly sparse (mostly zeros) because most words do not appear in most documents. Storing this directly is wasteful.

image.png

  • The Inverted Index (Postings List): Instead of storing the empty cells, we only store the non-zero entries.
    • Transformation: A sparse row for "fish" like [1, 1, 0, 0] becomes a compact list: fish -> [Doc1, Doc2].
    • Diagram Explanation: It visualizes this shift from a large, empty grid (Left) to a compact set of pointers (Right), illustrating the massive space savings.

image.png

Scaling Assumptions & Laws

When building an index for the web (billions of documents), we encounter two fundamental laws of text data that dictate our engineering constraints.

Heaps' Law: Vocabulary Growth

It is a common misconception that the number of unique words in a language is fixed (e.g., the size of the Oxford Dictionary). Heaps' Law proves otherwise.

\[ M = kT^b \]
  • \(M\): Vocabulary Size (unique words).
  • \(T\): Collection Size (total words processed).
  • \(k, b\): Constants (typically \(k \approx 30-100\), \(b \approx 0.4-0.6\)).
  • Implication: Vocabulary grows unbounded. As we index more documents, we constantly discover new proper nouns, typos, and neologisms. You cannot pre-allocate a fixed-size hash map for terms.
  • Linear in log-log space

Zipf's Law: Term Distribution

Words are not distributed evenly. A tiny fraction of words appear constantly, while the vast majority appear rarely.

\[ f(k; s, N) = \frac{1/k^s}{\sum_{n=1}^{N} (1/n^s)} \]
  • \(N\): This is the total number of unique items (vocabulary size) in the set. For a corpus of text, this is the number of distinct words.
  • \(k\): This is the rank of the word when sorted by frequency.
  • \(s\): This is a parameter that characterizes the distribution (often close to 1 in natural language). It controls how quickly the frequency drops off as the rank increases.
  • Implication:
    • The Head: Common words ("the", "is") have massive postings lists with millions of entries. This creates "stragglers" or hotspots during parallel processing.
    • The Tail: Millions of rare words have tiny postings lists, creating overhead if not managed efficiently.
  • Linear in log-log space

Building the Index with MapReduce

We can parallelize index construction using the MapReduce framework, but we must handle the scaling issues introduced by Zipf's Law.

The Naive Approach

  • Mapper: Tokenizes documents and emits (term, (docid, freq)).

    emit(term, (docid, freq))
    
  • Reducer: Groups by term and creates the list.

    p = list()
    for docid, freq in postings:
        p.append((docid, freq)) # Buffering
    p.sort()                    # Sorting in memory
    

The Scaling Bottleneck (OOM)

The naive approach fails because of Zipf's Law.

  • The Crash: For a word like "the", the Reducer receives billions of records. Trying to append them all into a list (p) to sort them causes an Out of Memory (OOM) error.

The Solution: Secondary Sorting

We leverage the MapReduce framework's internal sorting capabilities to avoid buffering in memory.

  • Key Design: Change the key from term to a composite (term, docID).
  • Partitioner: Ensure all keys with the same term are sent to the same Reducer.
    • Why? If same term but on different machines, you have broken the index. You now have two fragmented lists for the term stored in different files on different machines. Searching for the term would require querying every machine and merging results again, which defeats the purpose of building an index.
  • Execution: The framework automatically sorts the keys. By the time the data hits the Reducer, it is already sorted by docID.
  • Result: The Reducer no longer needs to buffer or sort; it simply streams the data and writes it to disk.

Spark Implementation

This pattern is replicable in Spark using specific transformations:

  1. Map the tokenizer and WordCount program to get (docID, Map[term, freq]) pairs
  2. flatMap to generate ((term, docID), freq) pairs.
  3. repartitionAndSortWithinPartitions to perform the secondary sort efficiently.
  4. mapPartitions to write the final structure.

Information Retrieval: Index Compression

Building an index is only half the battle; storing it efficiently is the other. We use compression not just to save disk space, but primarily to ensure the Postings Lists fit into memory for fast searching.

Delta Encoding (Gap Encoding)

Instead of storing absolute Document IDs (which grow indefinitely), we store the gaps (differences) between consecutive IDs.

  • The Concept: Since Postings Lists are sorted, we can represent a sequence [1, 6, 11, 15] as the gaps [1, 5, 5, 4].
  • The Goal: Absolute DocIDs get very large (requiring 4 bytes). Gaps, however, tend to be small integers, which require fewer bits to store.
  • Connection to Zipf's Law:
    • Common Terms: Have dense lists (e.g., appear in every other document). The gaps are tiny (often 1 or 2), making them highly compressible.
    • Rare Terms: Have sparse lists. While individual compression matters less, there are so many rare terms in the tail that compressing them adds up to significant savings.

Variable-Width Integers (VInt)

Standard integers use a fixed width (usually 4 bytes / 32 bits). Storing a small gap like 5 as 00000000...0101 wastes massive amounts of space. VInts solve this by using only the minimum number of bytes required.

Hadoop's VInt Implementation

Hadoop uses a specific schema to encode integers using 1 to 5 bytes (same range as 4 bytes fixed length).

  • Small Numbers: If \(x \in [-112, 127]\), it is stored in 1 byte.
  • Large Numbers:

    • The first byte is a "Magic" byte starting with 1000. It encodes the Length (how many following bytes to read) and the Sign.
    • The subsequent bytes contain the actual number.
    • Example: Storing 747 takes 2 bytes instead of 4.

    image.png

    • The "Leftovers": If you use 240 slots for numbers, you have 16 slots left over (specifically, the binary values corresponding to -128 to -113).
    • Purpose: These 16 values are used as headers (or "Magic Bytes") for larger numbers.
    • Structure (1000SLLL): When the computer sees a byte starting with 1000, it knows "This is not a number; it is an instruction."
      • 1000: The identifier prefix.
      • S (Sign): Tells if the number following is positive or negative.
        • Why -x - 1 for negative value? It prevents overflow.

          In standard 32-bit signed integers (Two's Complement), the range is asymmetric:

          • Minimum: −2,147,483,648
          • Maximum: +2,147,483,647

          If you tried to simply store the absolute value (magnitude) of the number (i.e., just -x), you would hit a fatal error with the minimum value:

          • x=−2,147,483,648
          • −x=+2,147,483,648
          • CRASH: This number is too large to fit in a positive signed integer (it exceeds the max by 1).

          By using -x - 1, you shift the value slightly so it fits safely:

          • x=−2,147,483,648
          • −x−1=2,147,483,647
          • SUCCESS: This fits exactly into Integer.MAX_VALUE. - LLL (Length): Tells the computer how many extra bytes it needs to read to find the actual number (supports lengths of 1 to 8 bytes). - The LLL bits do not store the number directly. Instead, they store the length using 3-bit Two's Complement representation of the negative length.
          • Technically VInt and VLong are exactly the same, since VInt is just a wrapper of VLong.

VLQ (Variable Length Quantity)

Often confused with VInt, this is a more general standard (used in MIDI, Protocol Buffers).

  • Mechanism: It slices the number into 7-bit chunks.
  • The "Continue" Bit: The highest bit (8th bit) of each byte is a flag.
    • 1: "More bytes coming."
    • 0: "This is the last byte."
  • Example: 74 fits in 1 byte (0100 1010). 767 requires 2 bytes.

Step-by-Step Example: Encoding 767

1. Convert to Binary First, write 767 in standard binary form.

767=10111111112

2. Slice into 7-bit Groups (Septets) We need to chop this binary string into chunks of 7 bits, starting from the right (least significant bits).

  • Chunk 1 (Rightmost 7 bits): 1111111
  • Chunk 2 (Remaining bits): 101 (We pad this with zeros to make it 7 bits: 0000101)

So, our two groups are: 0000101 and 1111111.

3. Add the "Continue" Flags Now we add the 8th bit to the front of each group to tell the computer if it should keep reading.

  • Byte 1 (High Order): This is the first byte, but not the last byte. So the flag is 1.
    • Flag (1) + Data (0000101) → 10000101
  • Byte 2 (Low Order): This is the last byte. So the flag is 0.
    • Flag (0) + Data (1111111) → 01111111

4. The Final Result Putting it all together, 767 requires 2 bytes: 1000 0101 0111 1111

Storage Structures: ArrayWritable vs. BytesWritable

When implementing this in MapReduce, object overhead matters.

  • The Mistake: Using ArrayWritable[VInt]. This stores an array of objects, where each VInt object carries significant Java heap overhead.
  • The Fix: Using BytesWritable. You manually pack the compressed VInts (raw bytes) into a single byte array. This is much denser.

The question of why ArrayWritable[VInt] should be rejected in favor of BytesWritable centers on Java's memory architecture. Using ArrayWritable creates an array of distinct objects, where each compressed integer is wrapped in a heavy object structure with significant metadata overhead, effectively negating the benefits of compression. In contrast, BytesWritable is preferred because it serves as a single container for a contiguous blob of raw bytes, allowing integers to be tightly packed (via WritableUtils.writeVInt) without the massive memory burden of individual object wrappers.

Simple-9 (Bit Packing)

An alternative to byte-aligned compression (like VInt) is word-aligned compression. Simple-9 attempts to pack as many small numbers as possible into a single 32-bit word.

  • Structure:
    • Selector (4 bits): Describes how the remaining bits are divided.
    • Payload (28 bits): Stores the data.
  • Modes: The selector chooses the optimal split for the next batch of numbers:
    • 28 \(\times\) 1-bit numbers.
    • 14 \(\times\) 2-bit numbers.
    • 9 \(\times\) 3-bit numbers (1 bit wasted).
    • 7 \(\times\) 4-bit numbers, etc.
  • Advantage: Very fast decoding because it aligns with CPU instructions, unlike variable-byte stream processing.

Bit-Level Compression Methods

While Simple-9 operates at the Word level (packing values into 32 bits) and VInt operates at the Byte level (using 1-5 bytes), more advanced methods operate at the Bit level. These techniques discard the constraint of byte-alignment to squeeze out maximum efficiency for specific statistical distributions.

Elias \(\gamma\) Code

Elias \(\gamma\) (Gamma) coding is designed for compressing unbounded natural numbers that follow a Power Law distribution (where small numbers are much more common than large ones).

The Algorithm To encode an integer \(x\):

  1. Calculate Magnitude: Let \(N = \lfloor \log_2 x \rfloor\). This tells us how many bits are strictly necessary to represent \(x\).
  2. Write Prefix: Write \(N\) zeros. This serves as a "unary" header telling the decoder how long the number is.
  3. Write Suffix: Write \(x\) as a standard binary number using \(N+1\) bits, which starts with 1 (trust Dan).
    • Example: To encode 9 (\(1001_2\)):
      • \(N = \lfloor \log_2 9 \rfloor = 3\).
      • Write 3 zeros: 000.
      • Write 9 in 4 bits: 1001.
      • Result: 0001001.

Decoding The decoder reads zeros until it hits a 1. If it counts 3 zeros, it knows the number occupies the next \(3+1=4\) bits. It interprets those bits as the integer.

Does WELL for tern frequencies, OK for gaps too.

Golomb Coding

Golomb coding is a parametric method highly effective for Gap Encoding. Unlike universal codes (like Elias), Golomb allows you to tune the compression based on how frequent a specific term is.

The Concept

Gap sizes follow a specific probability distribution (Poisson). Terms that appear often have small gaps; rare terms have massive gaps. Golomb coding uses a tunable parameter \(M\) to optimize for these different densities.

The Algorithm

It divides the number \(x\) into a Quotient (\(q\)) and a Remainder (\(r\)) using the divisor \(M\).

\(q = \lfloor (x-1) / M \rfloor\)

\(r = x - qM - 1\)

  1. Encode Quotient (\(q\)): Stored in Unary (write \(q\) zeros, then a 1).
  2. Encode Remainder (\(r\)): Stored using Truncated Binary Encoding (a slightly more efficient version of binary for fixed ranges), where we set \(N=M\).
    • Since Golomb uses Truncated Binary specifically to encode the Remainder, the "set of items" we need to encode is the set of possible remainders.

Truncated Binary Encoding

Standard binary is wasteful if the range isn't a power of 2 (e.g., if you only need to store values 0-14, using 4 bits for everything wastes the codes 15, 16..). Truncated binary solves this by using \(k\) bits for smaller numbers and \(k+1\) bits for larger numbers.

The Variables:

  • \(N\): The size of the set (e.g., 15 items: \(\{0, 1, ..., 14\}\)).
  • \(k = \lfloor \log_2 N \rfloor\): The minimum number of bits needed (e.g., \(\lfloor \log_2 15 \rfloor = 3\)).
  • \(u = 2^{k+1} - N\): The "cutoff" point (e.g., \(2^4 - 15 = 1\)).

The Rules:

  1. First \(u\) codewords: Encoded using \(k\) bits (standard binary).
  2. Remaining \(N-u\) codewords: Encoded using \(k+1\) bits, but shifted by adding \(u\) to the value.

Example: Encoding 0-14 (\(N=15\))

  • \(k=3\), \(u=1\).
  • Value 0: First codeword. Use 3 bits (\(k\)). \(\rightarrow\) 000.
  • Value 1: Second codeword (past cutoff). Use 4 bits (\(k+1\)) and add offset \(u\) (\(1+1=2\)). \(2 \rightarrow\) 0010.
  • Value 2: Use 4 bits, add offset (\(2+1=3\)). \(3 \rightarrow\) 0011.
  • Value 14: Use 4 bits, add offset (\(14+1=15\)). \(15 \rightarrow\) 1111.

Optimization

The power of Golomb coding is that \(M\) is calculated uniquely for every single term based on the ratio of documents containing that term (\(D_T\)) to the total documents (\(D\)).

\[ M \approx \left\lceil \frac{\log(2-z)}{-\log(1-z)} \right\rceil \text{ where } z = \frac{D_T}{D} \]

This means a common word like "the" uses a different compression scheme than a rare word like "defenestrate".

This is for situations where we only care about “contains” or “doesn’t contain”, not the number of times a term appears in a document.

When M is a power of 2, where \(u=M\), Golomb coding is specifically called Rice Coding.

Golomb Coding in Production (MapReduce)

Implementing Golomb coding in a distributed system like MapReduce presents a "Chicken and Egg" problem.

  • The Problem: To calculate the parameter \(M\) for a term, we need its Document Frequency (\(D_T\)). However, we only know the total count \(D_T\) after we have finished processing all the documents.
  • The Solution: We must perform a preliminary pass or use a "special key" that sorts first to gather metadata before compressing the actual postings list.

Compressing Frequencies

A real index stores more than just Document IDs; it stores Frequencies (e.g., Term -> [(Doc1, freq:2), (Doc2, freq:1)]).

  • Interleaving: We effectively have two streams of data: DocID Gaps and Frequency Counts.
  • Separation Strategy: It is best to treat these streams independently because they follow different statistical distributions.
    • DocIDs: Use Gap Encoding (Golomb/Gamma).
    • Frequencies: Usually small integers
  • Alignment: Do not mix byte-aligned compression (VInt) with bit-level compression (Golomb) in the same stream, as it complicates decoding. Encode them separately.

Comparison of Index Compression Methods

Method Granularity Mechanism / Logic Best Use Case Pros & Cons
VInt (Variable Integer) Byte-Level Uses 1 to 5 bytes. Small numbers (-112 to 127) fit in 1 byte; larger numbers use a header byte + payload bytes. General Purpose. Default for Hadoop MapReduce. Good when strict bit-level optimization isn't required. Pros: Simple to implement; "good enough" compression (reduced 182MB \(\rightarrow\) 78MB in benchmark).
Cons:

Wastes space for very small numbers compared to bit-level methods (e.g., storing a 1 still takes 8 bits). | | Simple-9 | Word-Level (32-bit) | Packs as many numbers as possible into a single 32-bit word using a 4-bit selector and 28-bit payload (e.g., 28 \(\times\) 1-bit numbers, or 14 \(\times\) 2-bit numbers). | High-Speed Decoding. Good when CPU speed is prioritized over maximum compression ratio. | Pros: Very fast decoding because it aligns with CPU instructions. Cons:

Can waste bits (e.g., 1 bit wasted in 3-bit mode). | | Elias \(\gamma\) (Gamma) | Bit-Level | Encodes length \(N\) in unary (zeros), followed by the number in binary (\(N+1\) bits). Assumes a Power Law distribution. | Term Frequencies. Excellent for small integers without an upper bound. Also "OK" for gap encoding. | Pros: Significant space savings over VInt (reduced index to 44MB). Cons:

Slower than VInt due to bit-manipulation complexity. | | Golomb | Bit-Level | Parametric (\(M\)). Splits number into Quotient (Unary) and Remainder (Truncated Binary). \(M\) is tuned based on term density. | Gap Encoding (DocIDs). Ideal for geometric distributions (gaps) where \(M\) can be calculated from document frequency. | Pros: Best compression ratio (reduced index to ~41MB). Cons:

Complex to implement (requires calculating \(M\) and handling truncated binary). | | Rice Codes | Bit-Level | A special case of Golomb where \(M\) is a power of 2 (\(2^k\)). | Gap Encoding. Used when you want Golomb efficiency but faster processing. | Pros: Simpler math than generic Golomb because the remainder is just standard binary (no "truncated" logic needed). Cons:

Slightly less flexible than generic Golomb. |

Retrieval

Boolean Retrieval

We shift from Offline (building the index) to Online (querying the index).

Hits: set of the documents for which the query is true

  • The Model: Boolean Retrieval is the simplest search model. Queries use AND, OR, and NOT to define exact set-based logic.
  • Execution:
    1. Parse: Convert the query (blue AND fish) OR ham into a Syntax Tree.
    2. Fetch: Retrieve the Postings Lists for every term in the tree.
    3. Merge:
      • AND \(\rightarrow\) Set Intersection.
      • OR \(\rightarrow\) Set Union.
      • NOT \(\rightarrow\) Set Negation (or exclusion).

Once we have an index, we need an algorithm to process a Boolean query (e.g., (blue AND fish) OR ham). There are two primary approaches to traversing the postings lists.

Term-At-A-Time (TAAT)

This strategy fully processes one term's list before moving to the next.

image.png

  • The Workflow:
    1. Process "Blue" & "Fish": It reads the entire list for "blue" and the entire list for "fish". It merges them to create a temporary intermediate result ("blue AND fish").
    2. Process "Ham": It then reads the "ham" list and merges it with that intermediate result to form the final answer.
  • Mechanism: It relies on Accumulators (accumulating scores or boolean states for documents as it sees them).
  • Trade-off:
    • Pros: Efficient for disk access (reads files sequentially one by one).
    • Cons: Memory Intensive. If "blue" and "fish" are common, the intermediate result list can be massive, potentially overflowing memory.

Document-At-A-Time (DAAT)

This strategy processes all terms in parallel, document by document.

image.png

  • The Workflow: It opens pointers to all relevant lists ("blue", "fish", "ham") simultaneously.
  • The "Smallest Next Doc" Logic:

    • It looks at the current pointer for every list and picks the smallest Document ID visible (e.g., Doc 1).
    • It evaluates the full query logic for just that document (Does Doc 1 have ham? Yes. Include it).
    • It advances the pointers and repeats for the next smallest ID.
      • You "advance" a list only if it currently points to the document ID you just finished processing.

    Here is the breakdown of why this happens for "fish" and "ham" specifically in this step:

    1. The Situation: The algorithm looks at the current head of every list to find the "Smallest Next Doc."
      • Blue is at 2.
      • Fish is at 1.
      • Ham is at 1.
      • Result: The smallest ID is 1.
    2. The Processing: The algorithm processes Doc ID 1. It checks if Doc 1 matches the query (blue AND fish) OR ham.
      • Since Ham has 1, the query is True. We add Doc 1 to our results.
    3. The "Advance" Step: Now that we are completely finished with Doc ID 1, we need to move past it so we don't look at it again.
      • Fish List: Currently points to 1. Since we just finished Doc 1, we advance this pointer to the next number (2).
      • Ham List: Currently points to 1. We advance this pointer to the next number (3).
      • Blue List: Currently points to 2. Since 2 > 1, this list is already "ahead" of the current document we are processing. We do not advance it; we wait for the other lists to catch up.
    4. Trade-off:
    5. Pros: Memory Efficient. It never stores large intermediate lists; it decides if a document is a "hit" instantly and moves on.
    6. Cons: Requires managing multiple open file handles and jumping between lists.

Distributed Partitioning

When the index is too large for one machine, we must split it.

  • Option A: Partition by Term (Global Index):
    • Machine A holds "A-M", Machine B holds "N-Z".
    • Problem: A query like "Apple AND Zebra" requires shuffling data between Machine A and Machine B. This is slow and complex.
  • Option B: Partition by Document (Local Index):
    • Machine A holds all terms for Docs 1-100; Machine B holds all terms for Docs 101-200.
    • Advantage: Each machine can calculate its own "hits" independently. The final step is just concatenating the results.
    • Verdict: "Document indexing is much nicer!" It avoids network shuffles, making it the standard for web search.

In the context of distributed systems, the "shuffle" problem is actually associated with Partitioning by Term (Global Index), not necessarily just the "Term-At-A-Time" retrieval algorithm itself.

Here is why Partitioning by Term forces a network shuffle:

1. The Setup (Global Index)

Imagine you split your index so that:

  • Machine A holds the posting list for "Apple" (and all other words starting with 'A').
  • Machine B holds the posting list for "Zebra" (and all other words starting with 'Z').

2. The Query

A user runs the query: Apple AND Zebra.

3. The Problem (The Shuffle)

To answer this, you need to find the intersection of the "Apple" list and the "Zebra" list.

  • Machine A has one list.
  • Machine B has the other.
  • Neither machine can solve the query alone.
  • The Shuffle: Machine A must send all its data for "Apple" to Machine B (or vice versa) over the network so that one machine has both lists in memory to perform the intersection.

Why this is bad:

In a large search engine, these lists can be millions of entries long. Transferring that much data over the network for every single query kills performance.

The Alternative (Document Partitioning):

If you use Document Partitioning instead, Machine A has both "Apple" and "Zebra" (for docs 1-100), and Machine B has both "Apple" and "Zebra" (for docs 101-200). Both machines can solve the query locally without sending any data to each other.

Ranked Retrieval & Scoring

The Problem with Boolean Retrieval

Boolean retrieval returns an unordered set of "hits," which is insufficient for large collections where users expect results sorted by relevance. Furthermore, strict Boolean logic fails to capture partial matches (e.g., a document containing 3 out of 4 query terms might be very relevant but would be rejected by a strict AND query).

TF-IDF Scoring

To solve this, we move to a "bag of words" model where relevance is calculated using weights.

  • The Logic:
    • Term Frequency (TF): If a term appears frequently in a document, it is important to that document (High Weight).
    • Document Frequency (DF): If a term appears in many documents across the collection (like "the"), it is not useful for distinguishing relevance (Low Weight).
  • The Formula:

    \[ w_{ij} = tf_{ij} \log\left(\frac{N}{df_i}\right) \]
    • \(w_{ij}\): The weight (relevance) of term \(i\) in document \(j\). This is the final score representing how important that specific word is to that specific document.
    • \(tf_{ij}\): The Term Frequency. This is the number of occurrences of term \(i\) in document \(j\) (i.e., how many times the word appears in the document). A higher count means the word is more significant to that document.
    • \(df_{i}\): The Document Frequency. This is the number of documents in the entire collection that contain term \(i\). A higher value means the word is common (like "the" or "is") and therefore less useful for distinguishing relevance.
    • \(N\): The Total Number of Documents in the collection. This constant is used to normalize the inverse document frequency.

    Note: The total relevance of a document for a query is calculated by taking the sum of the \(w_{i}\) values for all query terms found in that document.

Algorithms for Ranked Retrieval

We adapt the previous retrieval strategies to handle scores instead of just True/False matches.

  • Document-At-A-Time (DAAT) for Ranking:
    • Method: Calculate the full score for Doc 1, then Doc 2, etc. Use a min-heap (priority queue) to track only the "Top K" best scores seen so far.
    • Trade-off: Memory efficient (\(O(k)\)), but slow because you generally can't stop early; you must check the whole collection to be sure you didn't miss a high-scoring document. Another pro, it is easily distributed.
  • Term-At-A-Time (TAAT) for Ranking:

    • Method: Process the rarest term first to initialize accumulators, then update scores as you process other terms.
    • Trade-off: Memory heavy (accumulators for every potential hit), but allows for Early Termination heuristics (stopping if a document clearly won't beat the current top scores).

    Example

    1. The Setup: Start Small (Step 1)

    The algorithm doesn't just process terms left-to-right. It intentionally starts with the rarest term (the one found in the fewest documents).

    • Why? Because the final list of matches can never be larger than the smallest list.
    • Example: Imagine "Fish" is rare (only in 3 docs) and "Blue" is common (in 1,000 docs).
    • Action: We open the list for "Fish" first. We create a "scorecard" (Accumulator) for every document that has "Fish."
      • Accumulator: {Doc1: Score=5, Doc2: Score=8, Doc3: Score=2}.

    2. The Gauntlet: Filter and Update (Step 2)

    Now we take this small list of survivors (Docs 1, 2, 3) and check them against the next term ("Blue"). We do not scan the entire list of 1,000 "Blue" documents. We only check if our survivors contain "Blue".

    • Logic A: The Elimination (Sub-step 1)
      • We check Doc 2. Does it have "Blue"?
      • No? Then it fails the "AND" requirement.
      • Action: Remove it from the accumulator. It is disqualified. (Doc 2 is gone).
    • Logic B: The Bonus (Sub-step 2)
      • We check Doc 1. Does it have "Blue"?
      • Yes? It survives!
      • Action: Calculate the score for "Blue" and add it to the existing score.
      • New Score: Doc1: 5 (Fish) + 3 (Blue) = 8.
    • Hybrid Approach: In production, indices are often partitioned by "Quality" (e.g., Tier 1 = Best Pages). The system runs DAAT on the best partition first and stops if it finds enough good results.
    • This pre-sorting allows the search engine to look only at the "Best" partition first. If it finds enough relevant results there, it can stop early without ever searching the lower-quality "junk".
    • Before a query is even run, documents are assigned a quality tier. Determining if a document is "good" (high quality) is handled during the indexing phase through a Quality Assessment process.

Semantic Search (Vectors)

Standard indexing fails on synonyms (e.g., searching "love" misses "romance").

  • Embeddings: We can use word2vec to create synonym tables or doc2vec to turn documents into vectors.
  • Vector Databases: Tools like Pinecone or Vespa use HNSW (Hierarchical Navigable Small World graphs) indices to perform "Nearest Neighbor" searches on these vectors, finding documents that are conceptually similar rather than just matching keywords.