Module 10: Data Compression¶

Background¶

Data Compression Introduction¶

The problem: How to store and transmit data efficiently?

Source text: The original data, string $S$ of characters from the source alphabet $\Sigma_S$
Coded text: The encoded data, string $C$ of characters from the code alphabet $\Sigma_C$
Encoding: An algorithm mapping source texts to coded texts
Decoding: An algorithm mapping coded texts back to their original source text

\[ S \xrightarrow{\text{encode}} C \xrightarrow{\text{transmit}} C \xrightarrow{\text{decode}} S \]

Source "text" can be any sort of data (not always text!)
The code alphabet $\Sigma_C$ is usually $\{0, 1\}$.
We consider here only lossless compression: Can recover $S$ from $C$ without error.

Judging Encoding Schemes¶

Main objective: Minimize the size of the coded text by measuring the compression ratio:

\[ \frac{|C| \cdot \log |\Sigma_C|}{|S| \cdot \log |\Sigma_S|} \]

Examples

$73 = (73)_{10} \to (1001001)_{2}$

Compression ratio: $\frac{7 \cdot \log 2}{2 \cdot \log 10} \approx 1.05$
$127 = (127)_{10} \to (7F)_{16}$

Compression ratio: $\frac{2 \cdot \log 16}{3 \cdot \log 10} \approx 0.8$

Efficiency

Encoding/decoding requires time $\Omega(|S| + |C|)$, sometimes more.

Other goals (not covered here)

Reliability (e.g., error-correcting codes)
Security (e.g., encryption)

Impossibility of Compressing¶

Observation: No lossless encoding scheme can achieve a compression ratio less than 1 for all input strings.

Proof (for $\Sigma_S = \Sigma_C = \{0, 1\}$):

Fix a size $n$.
Assume for contradiction that all length-$n$ bitstrings are compressed to a shorter length (at most $n-1$).
This leads to a mismatch in set sizes:
- Bitstrings of length $n$: $2^n$ elements.
- Bitstrings of length $\leq n-1$: $2^{n-1} + 2^{n-2} + \dots + 2^0 < 2^n$.

Since the compressed set cannot cover all possible input strings, the assumption is false.

Conclusion:

Worst-case compression bounds are inherently limited.
However, real-life data often contains patterns, allowing effective compression.

Detour: Streams (Review)¶

Large texts often exceed memory capacity, so we use streams to handle the input and output for storing $S$ and $C$.

Input-stream ($\sim$ std::cin): Reads one character at a time.
- Uses a stack-like interface: top / pop / is-empty.
Output-stream ($\sim$ std::cout): Writes one character at a time.
- Uses a queue-like interface: append.

Advantages

Enables processing text as it loads.
Some algorithms may require resetting the input-stream.

Single-Character Encodings¶

Character Encodings¶

A character encoding (or single-character encoding) maps each character in the source alphabet to a string in the code alphabet:

\[ E : \Sigma_S \to \Sigma_C^* \]

ASCII uses a (length-7) fixed-length code where each codeword $E(c)$ has the same length.

Encoding/decoding is straightforward by concatenating or decoding the next 7 bits.

Example for APPLE: $\leftrightarrow (65, 80, 80, 76, 69)$ $\leftrightarrow 1000001\ 1010000\ 1010000\ 1001100\ 1000101$

Earlier fixed-length codes include Caesar shift, Baudot code, and Murray code. Fixed-length codes are simple but do not compress data.

Variable-Length Codes¶

Better idea: Use variable-length codes.

Motivation: Some letters in $\Sigma_S$ occur more frequently than others. Example: Letter frequencies in English text: $e: 12.70\%$, $t: 9.06\%$, $a: 8.17\%$, $o: 7.51\%$, $i: 6.97\%$, $n: 6.75\%$, etc.

Idea: Assign shorter codes to more frequent characters.

Similar to fixed-length codes, map the source alphabet to codewords $E : \Sigma_S \to \Sigma_C^*$, but:

Codewords do not need to be of the same length.
This reduces the overall size of the encoded text.

Encoding¶

Assume a character encoding $E : \Sigma_S \to \Sigma_C^*$, where $E$ is a dictionary with keys in $\Sigma_S$.

**singleChar::encoding(E, S, C)**
E: the encoding dictionary
S: input-stream with characters in Σ_S C : output-stream of bits

while S is non-empty
        w ← E.search(S.pop())
        append each bit of w to C

Decoding¶

The decoding algorithm must map $\Sigma_C^*$ to $\Sigma_S^*$.

The code must be lossless, meaning it is uniquely decodable.
- (Morse code uses pauses as end-sentinels to avoid ambiguity.)

From now, we only consider prefix-free codes $E$:

No codeword is a prefix of another.
This corresponds to a trie where characters of $\Sigma_S$ only appear at the leaves.

Prefix-free codes eliminate the need for an end-sentinel $\$$.

Decoding of Prefix-Free Codes¶

Any prefix-free code is uniquely decodable.

**prefixFree::decoding(C, S, T)**
C: input-stream with characters in Σ_C
S: output-stream
T: encoding-trie

while C is non-empty
        z ← T.root
        while z is not a leaf
                if C is empty or z has no child labelled C.top()
                        return "invalid encoding"
                z ← child of z that is labelled with C.pop()
        S.append(character stored at z)

Run-time: $O(|C|)$

Encoding from Trie¶

**prefixFree::encoding(S, C, T)**
S: input-stream with characters in Σ_S
C: output-stream
T: encoding trie

E ← array of trie-nodes indexed by Σ_S
for all leaves ℓ in T do E[character at ℓ] ← ℓ
while S is non-empty
        w ← empty bitstring
        v ← E[S.pop()]
        while v is not the root
                w.prepend(character from v to its parent)
        append each bit of w to C

Run-time: $O(|T| + |C|)$
Assumes all interior nodes of $T$ have two children; otherwise, the encoding scheme can be improved.
Therefore, $|T| \leq 2|\Sigma_S| - 1$, leading to a run-time of $O(|\Sigma_S| + |C|)$.

Huffman’s Encoding Trie¶

Huffman's Algorithm: Building the Best Trie¶

Objective: Determine the "best" trie for a given source text $S$.

Idea: Frequent characters should have short codewords. Infrequent characters should be "far down" in the trie.

Greedy Algorithm

Initialization: Start with a set of encoding tries, where each character $c \in \Sigma_S$ adds a height-0 trie holding $c$.
Weight: Each trie has a weight equal to the sum of frequencies of its characters (pre-computed).
Merge Tries: Find the two tries with the minimum weights and merge them.
- This corresponds to adding one bit to the encoding of each character.
Repeat: Continue merging until only one trie remains.

Efficiency:

Use a min-ordered heap for storage.
Step 3 requires two delete-min operations and one insert.

Huffman's Algorithm: Pseudocode¶

**Huffman::encoding(S, C)**
S: input-stream with characters in Σ_S
C: output-stream

f ← array indexed by Σ_S, initially all-0 // frequencies
while S is non-empty do increase f[S.pop()] by 1
Q ← min-oriented priority queue that stores tries // initialize PQ
for all c ∈ Σ_S with f[c] > 0 do Q.insert(c, f[c])
while Q.size() > 1 do // build decoding trie
        (T1, f1) ← Q.delete-min()
        (T2, f2) ← Q.delete-min()
        Q.insert(trie with T1, T2 as subtries, f1+f2)
T ← Q.delete-min()
C.append(encoding trie T) // send trie
Re-set input-stream S
prefixFree::encoding(S, C, T) // actual encoding

Note: This assumes $S$ has at least two distinct characters.

Huffman Coding Discussion¶

Optimality: The constructed trie is optimal, producing the shortest $C$ among prefix-free single-character encodings with $\Sigma_C = \{0, 1\}$.

Limitations

Constructed trie is not unique unless tie-breaking rules are provided.
Decoder does not know the trie:
- Must either send the decoding trie (complex bitstring conversion).
- Or send character frequencies and tie-breaking rules.
- Both methods increase the encoded text length.
Encoding requires two passes:
- First to compute frequencies.
- Second to encode.
- Cannot use a stream unless it supports re-setting.

Run-time

Encoding: $O(|\Sigma_S| \log |\Sigma_S| + |C|)$
Decoding: $O(|C|)$

There are many variations, such as handling unused characters, frequency estimation, or adaptive encoding adjustments.

Run-Length Encoding¶

Run-Length Encoding Idea¶

Definition: A form of multi-character encoding where multiple source-text characters are represented by a single code-word.
Requirements:
- Input must be a bitstring $(\Sigma_S = \{0, 1\})$.
- The decoding dictionary is implicit and uniquely defined.
Use Case: Effective when the input $S$ has long runs of the same character, e.g., $00000 \; 111 \; 0000000$.
Encoding Process:
1. Send the leftmost bit of $S$ (either 0 or 1).
2. Follow with a sequence of integers representing run lengths.
3. Example for $00000 \; 111 \; 0000000$: $0, 5, 3, 7$ (no need to specify bits as runs alternate).

Detour: Encoding Positive Integers¶

Encoding Techniques:

Base-2 Encoding:

$E_{b2}(k)$ is the string $w$ such that $(w)_2 = k$.
Bijective Base-2 Encoding:

$E_{bb2}(k) := E_{b2}(k+1)$ with the leading 1 removed.

(Bijective base-26 encoding is used to enumerate spreadsheet columns.)

For encoding a sequence of integers, a separator-symbol (e.g., #) is used:

Example 1:

$5, 3, 7 \to 101\#11\#111$

Encoding length: $|C| \log |\Sigma_C| = 10 \log(3) \approx 15.8$
Example 2:

$5, 3, 7 \to BA\#AA\#AAA$

Encoding length: $|C| \log |\Sigma_C| = 9 \log(3) \approx 14.22$

Run-Length Encoding Details¶

Elias Gamma Code (for an integer $k$):

To encode $k$:
1. Take binary representation $E_{b_2}(k)$ of $k$.
2. Add leading zeros so that the initial $1$ is in the middle.
3. This means adding $\lfloor \log k \rfloor$ leading zeros.

Definition: To perform run-length encoding:

Write the initial bit to output.
Determine the run lengths $k_1, k_2, \ldots, k_d$.
Write $E_{\gamma}(k_1) \cup E_{\gamma}(k_2) \cup \ldots \cup E_{\gamma}(k_d)$ to output.

Note:

Steps 2-3 can be done simultaneously.

RLE Encoding¶

**RLE::encoding(S, C)**
S: input-stream of bits, C: output-stream

b ← S.top(); C.append(b)
while S is non-empty do
    k ← 1  # k: length of run
    while (S is non-empty and S.top() = b) do
        k++; S.pop()
    w ← empty string  # w: binary encoding of k
    while k > 1
        C.append(0)  # write leading zeroes
        w.prepend(k mod 2)
        k ← ⌊k/2⌋
    w.prepend(1)
    append each bit of w to C
    b ← 1 - b

Claim: A sequence of Elias gamma codes can be decoded uniquely.

Each Elias gamma code has the form $0^\ell 1 \ast^\ell$ (where ‘∗’ means ‘some bit’).
Can determine $\ell$ by scanning until we encounter a 1.
Convert this 1 plus the next $\ell$ bits into the integer.

**RLE::decoding(C, S)**
C: input-stream of bits, S: output-stream

b ← C.pop() // b: bit of the current run
while C is non-empty
   ℓ ← 0     // ℓ: number of leading zeroes
   while C.pop() = 0 do ℓ++
   k ← 1     // k: integer that was encoded
   for (j ← 1 to ℓ) do k ← k * 2 + C.pop()
   for (j ← 1 to k) do S.append(b)
   b ← 1 − b

If C.pop() is called when there are no bits left, then C was not valid input.

RLE Discussion¶

An all-0 string of length $n$ would be compressed to $0E_\gamma(n)$, which has $2 \lfloor \log n \rfloor + 2 \in o(n)$ bits.
Usually, we are not that lucky:
- No compression until run-length $k \geq 6$.
- Expansion when run-length $k = 2$ or $4$.
RLE was popular for fax machines.
RLE can be adapted to larger alphabet sizes (but then the encoding of each run must also store the character).

Used in some image formats (e.g., TIFF).
The method can be adapted to encode only runs of 0.

Used inside bzip2 (see later).

Lempel-Ziv-Welch¶

Longer Patterns in Input¶

Huffman and RLE take advantage of frequent/repeated single characters.

Observation:

Certain substrings are much more frequent than others.

English text:
- Most frequent digraphs: TH, ER, ON, AN, RE, HE, IN, ED, ND, HA
- Most frequent trigraphs: THE, AND, THA, ENT, ION, TIO, FOR, NDE
HTML: <a href, <img src,
Video: repeated background between frames, shifted sub-image

Ingredient 1 for Lempel-Ziv-Welch compression: take advantage of such substrings without needing to know beforehand what they are.

Adaptive Dictionaries¶

Fixed Dictionaries:
- ASCII, UTF-8, and RLE use fixed dictionaries.
Static Dictionary:
- Huffman uses a static dictionary, which remains the same for the entire encoding/decoding process.
Ingredient 2 for LZW: Adaptive Encoding
- Starts with a fixed initial dictionary $D_0$ (usually ASCII).
- For $i \geq 0$, $D_i$ is used to determine the $i$-th output character.
- After writing the $i$-th character, the encoder updates $D_i$ to $D_{i+1}$.
Challenge:
- Decoder must be able to reconstruct how the encoder updated the dictionary from the coded text.

LZW Overview¶

Objective: Convert $S$ into a list of code-numbers.
Initial Setup:
- Start with dictionary $D_0$ storing ASCII characters (code-numbers $0, \ldots, 127$).
- Each step adds a multi-character string to the dictionary with code-numbers $128, 129, \ldots$.
Encoding Process:
1. Find longest string $w$ in $D_i$:
  - $w$ is the longest string of characters that can be encoded with one number.
2. Add useful substring to $D_i$:
  - Add $wc$ to the dictionary, where $c$ is the next character following $w$ in $S$.
Dictionary Data Structure:
- Use a trie to match characters efficiently.
- To find $w$:
  - Parse the trie until reaching a node $z$ where no child matches the next character $c$.
- To update $D_i$:
  - Add a child to $z$ with a link labeled $c$.

LZW decoding pseudocode¶

**LZW::decoding(C, S)**
C: input-stream of integers, S: output-stream

D ← dictionary that maps {0,...,127} to ASCII
idx ← 128
k ← C.pop(); s ← D.search(k); S.append(s)
while there are more codes in C do
    s_prev ← s; k ← C.pop()
    if k < idx do s ← D.search(k)
    else if k = idx do s ← s_prev ∘ s_prev[0]  // special situation
    else return FAIL  // invalid encoding!
    append each character of s to S
    D.insert(idx, s_prev ∘ s[0])
    idx++

Lempel-Ziv-Welch decoding details¶

Dictionary $D$ maps consecutive integers to words. Use an array!
Run-time: $O(|S|)$
- $\Theta(|s|)$ each round to append $s$ to output.
- Everything else takes no longer.
Dictionary wastes space: Words may get long
- To save space, store string as code of prefix + one character.
- Can still look up $s$ in $O(|s|)$ time.
- So run-time remains $O(|S|)$.

**LZW::dictionary-lookup(D, k)**

s ← empty word
while k > 127 do (k, c) ← D[k], s.prepend(c)
s.prepend(D[k])

Lempel-Ziv-Welch discussion¶

Encoding and decoding take $O(|S|)$ time (assuming constant alphabet).
Efficiency:
- Encoding and decoding need to go through the string only once.
- This allows compression to be performed while streaming the text.
Conversion to bitstring:
- So far, encoding used numbers. Convert to bitstring using fixed-length encoding.
- Example: Use length-12 encoding, e.g., $129 = 000010000001$.
- Stop adding to dictionary $D$ once reaching code-number $2^{12} - 1 = 4095$.
Compression efficiency:
- Achieves approximately 45% compression on English text.

Combining Compression Schemes: bzip2¶

bzip2 overview¶

Idea: Combine multiple compression schemes!

The key ingredient is to use text transforms: change input into a different text that compresses better.

Steps

Input Text: $T_0$
- Starting point.
Burrows-Wheeler Transform: $T_0 \to T_1$
- $T_1$ is a permutation of $T_0$.
- If $T_0$ has repeated substrings, then $T_1$ will have long runs of characters.
Move-to-Front Transform: $T_1 \to T_2$
- $T_2$ uses ASCII-numbers.
- If $T_1$ has long runs of characters, then $T_2$ will have:
  - Long runs of zeros.
  - Skewed frequencies (some numbers appear more often than others).
Modified Run-Length Encoding (RLE): $T_2 \to T_3$
- If $T_2$ has long runs of zeros, then $T_3$ is shorter.
- Skewed frequencies remain.
Huffman Encoding: $T_3 \to T_4$
- Compresses well because the frequencies are skewed.
- Final compressed output.

Each step in the transformation makes the text more suitable for compression, producing smaller outputs.

bzip2 — The Easy Steps¶

Move-to-Front Transform¶

Dictionary $D : \{0, \dots, 127\} \to \text{ASCII}$ is an unordered array with MTF.
Character $c$ gets mapped to index $i$ with $D[i] = c$.
A character in $S$ repeats $k$ times $\iff C$ has a run of $k-1$ zeroes.
We would expect lots of small numbers in the output.

Modified Run-Length Encoding¶

Input is a list of "characters" in $\{0, \dots, 127\}$.
Encode only the runs of 0s (other numbers unchanged).
Encode run-length $k$ with bijective base-2 encoding (shorter), using two new characters $A, B$.

Huffman Encoding¶

Exactly as seen before.

Burrows-Wheeler Transform¶

Burrows-Wheeler Transform (BWT)¶

Given

Source text $S$ as an array with an end-sentinel (e.g., '$').

Steps

Write Down Cyclic Shifts:
- Generate all cyclic shifts of the string $S$.
  - $i$th cyclic shift: move $i$ characters from front to back.
Sort Lexicographically:
- Arrange all cyclic shifts in lexicographical order using MSD-radix-sort.
- $\Theta(n)$ strings of length $\Theta(n) \Rightarrow$ $\Theta(n^2)$ worst-case time
Extract Rightmost Column:
- Take the last character of each sorted cyclic shift to form the BWT result.

Properties

The Burrows-Wheeler Transform (BWT) consists of the last characters of the cyclic shifts of $S$ after lexicographical sorting.
Observations:
- The result of the BWT is a permutation of $S$.
- Repeated substrings in $S$ cause runs of repeated characters in the BWT result, which is helpful for compression.

Fast Burrows-Wheeler Encoding¶

Input:

$S$ with an end-sentinel ($).

Steps:

Compute all cyclic shifts of $S$.
Sort cyclic shifts lexicographically to form the sorted order.
Extract the last column (rightmost character of each sorted cyclic shift).

Observations:

The same sorting permutation for cyclic shifts can be used for suffixes.
This results in the suffix array ($A^S$), which can be computed in $O(n \log n)$ time.

BWT Encoding:

The BWT encoding is read directly using:

\[ C[i] = \begin{cases} S[A^S[i] - 1] & \text{if } A^S[i] > 0, \\ \$ & \text{otherwise.} \end{cases} \]

Notes:

Suffix Array ($A^S$):
- Represents the starting indices of sorted suffixes in $S$.
This faster approach avoids explicitly constructing all cyclic shifts, saving time and space.

BWT Decoding Code¶

**BWT::decoding(C, S)**
C : string of characters, one of which (not necessarily last one) is $
S : output-stream

Initialize array A  // leftmost column
for all indices i of C
    A[i] ← (C[i], i)  // store character and index
Stably sort A by first aspect

for all indices j of C  // where is the $-char?
    if C[j] = $ break

do
    S.append(character stored in A[j])  // extract source text
    j ← index stored in A[j]
while appended character is not $

BWT and bzip2 Discussion¶

BWT Costs:

BWT Encoding Cost: $O(n \log n)$
- Requires reading the encoding from the suffix array.
BWT Decoding Cost: $O(n + |\Sigma_S|)$
- Faster than encoding.
Both encoding and decoding use $O(n)$ space.

Characteristics:

Requires all of the text (no streaming is possible).
BWT (and hence bzip2) is a block compression method, compressing one block at a time.

bzip2 Costs:

bzip2 Encoding Cost: $O(n (\log n + |\Sigma|))$ with a large constant.
bzip2 Decoding Cost: $O(n |\Sigma|)$ with a large constant.

General Observations:

bzip2 is slower than many other compression methods.
However, it achieves better compression due to the transformations (like BWT).