Module 10: Data Compression¶
Background¶
Data Compression Introduction¶
The problem: How to store and transmit data efficiently?
- Source text: The original data, string \(S\) of characters from the source alphabet \(\Sigma_S\)
- Coded text: The encoded data, string \(C\) of characters from the code alphabet \(\Sigma_C\)
- Encoding: An algorithm mapping source texts to coded texts
- Decoding: An algorithm mapping coded texts back to their original source text
- Source "text" can be any sort of data (not always text!)
- The code alphabet \(\Sigma_C\) is usually \(\{0, 1\}\).
- We consider here only lossless compression: Can recover \(S\) from \(C\) without error.
Judging Encoding Schemes¶
Main objective: Minimize the size of the coded text by measuring the compression ratio:
Examples
-
\(73 = (73)_{10} \to (1001001)_{2}\)
Compression ratio: \(\frac{7 \cdot \log 2}{2 \cdot \log 10} \approx 1.05\)
-
\(127 = (127)_{10} \to (7F)_{16}\)
Compression ratio: \(\frac{2 \cdot \log 16}{3 \cdot \log 10} \approx 0.8\)
Efficiency
- Encoding/decoding requires time \(\Omega(|S| + |C|)\), sometimes more.
Other goals (not covered here)
- Reliability (e.g., error-correcting codes)
- Security (e.g., encryption)
Impossibility of Compressing¶
Observation: No lossless encoding scheme can achieve a compression ratio less than 1 for all input strings.
Proof (for \(\Sigma_S = \Sigma_C = \{0, 1\}\)):
- Fix a size \(n\).
- Assume for contradiction that all length-\(n\) bitstrings are compressed to a shorter length (at most \(n-1\)).
- This leads to a mismatch in set sizes:
- Bitstrings of length \(n\): \(2^n\) elements.
- Bitstrings of length \(\leq n-1\): \(2^{n-1} + 2^{n-2} + \dots + 2^0 < 2^n\).
Since the compressed set cannot cover all possible input strings, the assumption is false.
Conclusion:
- Worst-case compression bounds are inherently limited.
- However, real-life data often contains patterns, allowing effective compression.
Detour: Streams (Review)¶
Large texts often exceed memory capacity, so we use streams to handle the input and output for storing \(S\) and \(C\).

- Input-stream (\(\sim\)
std::cin): Reads one character at a time.- Uses a stack-like interface:
top/pop/is-empty.
- Uses a stack-like interface:
- Output-stream (\(\sim\)
std::cout): Writes one character at a time.- Uses a queue-like interface:
append.
- Uses a queue-like interface:
Advantages
- Enables processing text as it loads.
- Some algorithms may require resetting the input-stream.
Single-Character Encodings¶
Character Encodings¶
A character encoding (or single-character encoding) maps each character in the source alphabet to a string in the code alphabet:
ASCII uses a (length-7) fixed-length code where each codeword \(E(c)\) has the same length.
Encoding/decoding is straightforward by concatenating or decoding the next 7 bits.
Example for APPLE: \(\leftrightarrow (65, 80, 80, 76, 69)\) \(\leftrightarrow 1000001\ 1010000\ 1010000\ 1001100\ 1000101\)
Earlier fixed-length codes include Caesar shift, Baudot code, and Murray code. Fixed-length codes are simple but do not compress data.
Variable-Length Codes¶
Better idea: Use variable-length codes.
- Motivation: Some letters in \(\Sigma_S\) occur more frequently than others. Example: Letter frequencies in English text: \(e: 12.70\%\), \(t: 9.06\%\), \(a: 8.17\%\), \(o: 7.51\%\), \(i: 6.97\%\), \(n: 6.75\%\), etc.
Idea: Assign shorter codes to more frequent characters.
Similar to fixed-length codes, map the source alphabet to codewords \(E : \Sigma_S \to \Sigma_C^*\), but:
- Codewords do not need to be of the same length.
- This reduces the overall size of the encoded text.
Encoding¶
Assume a character encoding \(E : \Sigma_S \to \Sigma_C^*\), where \(E\) is a dictionary with keys in \(\Sigma_S\).
**singleChar::encoding(E, S, C)**
E: the encoding dictionary
S: input-stream with characters in Σ_S C : output-stream of bits
while S is non-empty
w ← E.search(S.pop())
append each bit of w to C
Decoding¶
The decoding algorithm must map \(\Sigma_C^*\) to \(\Sigma_S^*\).
- The code must be lossless, meaning it is uniquely decodable.
- (Morse code uses pauses as end-sentinels to avoid ambiguity.)
From now, we only consider prefix-free codes \(E\):
- No codeword is a prefix of another.
- This corresponds to a trie where characters of \(\Sigma_S\) only appear at the leaves.
Prefix-free codes eliminate the need for an end-sentinel \(\$\).
Decoding of Prefix-Free Codes¶
Any prefix-free code is uniquely decodable.
**prefixFree::decoding(C, S, T)**
C: input-stream with characters in Σ_C
S: output-stream
T: encoding-trie
while C is non-empty
z ← T.root
while z is not a leaf
if C is empty or z has no child labelled C.top()
return "invalid encoding"
z ← child of z that is labelled with C.pop()
S.append(character stored at z)
Run-time: \(O(|C|)\)
Encoding from Trie¶
**prefixFree::encoding(S, C, T)**
S: input-stream with characters in Σ_S
C: output-stream
T: encoding trie
E ← array of trie-nodes indexed by Σ_S
for all leaves ℓ in T do E[character at ℓ] ← ℓ
while S is non-empty
w ← empty bitstring
v ← E[S.pop()]
while v is not the root
w.prepend(character from v to its parent)
append each bit of w to C
- Run-time: \(O(|T| + |C|)\)
- Assumes all interior nodes of \(T\) have two children; otherwise, the encoding scheme can be improved.
- Therefore, \(|T| \leq 2|\Sigma_S| - 1\), leading to a run-time of \(O(|\Sigma_S| + |C|)\).
Huffman’s Encoding Trie¶
Huffman's Algorithm: Building the Best Trie¶
Objective: Determine the "best" trie for a given source text \(S\).
- Idea: Frequent characters should have short codewords. Infrequent characters should be "far down" in the trie.
Greedy Algorithm
- Initialization: Start with a set of encoding tries, where each character \(c \in \Sigma_S\) adds a height-0 trie holding \(c\).
- Weight: Each trie has a weight equal to the sum of frequencies of its characters (pre-computed).
- Merge Tries: Find the two tries with the minimum weights and merge them.
- This corresponds to adding one bit to the encoding of each character.
- Repeat: Continue merging until only one trie remains.
Efficiency:
- Use a min-ordered heap for storage.
- Step 3 requires two
delete-minoperations and oneinsert.
Huffman's Algorithm: Pseudocode¶
**Huffman::encoding(S, C)**
S: input-stream with characters in Σ_S
C: output-stream
f ← array indexed by Σ_S, initially all-0 // frequencies
while S is non-empty do increase f[S.pop()] by 1
Q ← min-oriented priority queue that stores tries // initialize PQ
for all c ∈ Σ_S with f[c] > 0 do Q.insert(c, f[c])
while Q.size() > 1 do // build decoding trie
(T1, f1) ← Q.delete-min()
(T2, f2) ← Q.delete-min()
Q.insert(trie with T1, T2 as subtries, f1+f2)
T ← Q.delete-min()
C.append(encoding trie T) // send trie
Re-set input-stream S
prefixFree::encoding(S, C, T) // actual encoding
Note: This assumes \(S\) has at least two distinct characters.
Huffman Coding Discussion¶
Optimality: The constructed trie is optimal, producing the shortest \(C\) among prefix-free single-character encodings with \(\Sigma_C = \{0, 1\}\).
Limitations
- Constructed trie is not unique unless tie-breaking rules are provided.
- Decoder does not know the trie:
- Must either send the decoding trie (complex bitstring conversion).
- Or send character frequencies and tie-breaking rules.
- Both methods increase the encoded text length.
- Encoding requires two passes:
- First to compute frequencies.
- Second to encode.
- Cannot use a stream unless it supports re-setting.
Run-time
- Encoding: \(O(|\Sigma_S| \log |\Sigma_S| + |C|)\)
- Decoding: \(O(|C|)\)
There are many variations, such as handling unused characters, frequency estimation, or adaptive encoding adjustments.
Run-Length Encoding¶
Run-Length Encoding Idea¶
- Definition: A form of multi-character encoding where multiple source-text characters are represented by a single code-word.
- Requirements:
- Input must be a bitstring \((\Sigma_S = \{0, 1\})\).
- The decoding dictionary is implicit and uniquely defined.
- Use Case: Effective when the input \(S\) has long runs of the same character, e.g., \(00000 \; 111 \; 0000000\).
- Encoding Process:
- Send the leftmost bit of \(S\) (either 0 or 1).
- Follow with a sequence of integers representing run lengths.
- Example for \(00000 \; 111 \; 0000000\): \(0, 5, 3, 7\) (no need to specify bits as runs alternate).
Detour: Encoding Positive Integers¶
Encoding Techniques:
-
Base-2 Encoding:
\(E_{b2}(k)\) is the string \(w\) such that \((w)_2 = k\).
-
Bijective Base-2 Encoding:
\(E_{bb2}(k) := E_{b2}(k+1)\) with the leading 1 removed.
(Bijective base-26 encoding is used to enumerate spreadsheet columns.)

For encoding a sequence of integers, a separator-symbol (e.g., #) is used:
-
Example 1:
\(5, 3, 7 \to 101\#11\#111\)
Encoding length: \(|C| \log |\Sigma_C| = 10 \log(3) \approx 15.8\)
-
Example 2:
\(5, 3, 7 \to BA\#AA\#AAA\)
Encoding length: \(|C| \log |\Sigma_C| = 9 \log(3) \approx 14.22\)
Run-Length Encoding Details¶
Elias Gamma Code (for an integer \(k\)):
-
To encode \(k\):
- Take binary representation \(E_{b_2}(k)\) of \(k\).
- Add leading zeros so that the initial \(1\) is in the middle.
- This means adding \(\lfloor \log k \rfloor\) leading zeros.

Definition: To perform run-length encoding:
- Write the initial bit to output.
- Determine the run lengths \(k_1, k_2, \ldots, k_d\).
- Write \(E_{\gamma}(k_1) \cup E_{\gamma}(k_2) \cup \ldots \cup E_{\gamma}(k_d)\) to output.
Note:
Steps 2-3 can be done simultaneously.
RLE Encoding¶
**RLE::encoding(S, C)**
S: input-stream of bits, C: output-stream
b ← S.top(); C.append(b)
while S is non-empty do
k ← 1 # k: length of run
while (S is non-empty and S.top() = b) do
k++; S.pop()
w ← empty string # w: binary encoding of k
while k > 1
C.append(0) # write leading zeroes
w.prepend(k mod 2)
k ← ⌊k/2⌋
w.prepend(1)
append each bit of w to C
b ← 1 - b
Claim: A sequence of Elias gamma codes can be decoded uniquely.
- Each Elias gamma code has the form \(0^\ell 1 \ast^\ell\) (where ‘∗’ means ‘some bit’).
- Can determine \(\ell\) by scanning until we encounter a 1.
- Convert this 1 plus the next \(\ell\) bits into the integer.
**RLE::decoding(C, S)**
C: input-stream of bits, S: output-stream
b ← C.pop() // b: bit of the current run
while C is non-empty
ℓ ← 0 // ℓ: number of leading zeroes
while C.pop() = 0 do ℓ++
k ← 1 // k: integer that was encoded
for (j ← 1 to ℓ) do k ← k * 2 + C.pop()
for (j ← 1 to k) do S.append(b)
b ← 1 − b
If C.pop() is called when there are no bits left, then C was not valid input.
RLE Discussion¶
- An all-0 string of length \(n\) would be compressed to \(0E_\gamma(n)\), which has \(2 \lfloor \log n \rfloor + 2 \in o(n)\) bits.
- Usually, we are not that lucky:
- No compression until run-length \(k \geq 6\).
- Expansion when run-length \(k = 2\) or \(4\).
- RLE was popular for fax machines.
-
RLE can be adapted to larger alphabet sizes (but then the encoding of each run must also store the character).
Used in some image formats (e.g., TIFF).
-
The method can be adapted to encode only runs of 0.
Used inside bzip2 (see later).
Lempel-Ziv-Welch¶
Longer Patterns in Input¶
Huffman and RLE take advantage of frequent/repeated single characters.
Observation:
Certain substrings are much more frequent than others.
- English text:
- Most frequent digraphs: TH, ER, ON, AN, RE, HE, IN, ED, ND, HA
- Most frequent trigraphs: THE, AND, THA, ENT, ION, TIO, FOR, NDE
- HTML:
<a href,<img src, - Video: repeated background between frames, shifted sub-image
Ingredient 1 for Lempel-Ziv-Welch compression: take advantage of such substrings without needing to know beforehand what they are.
Adaptive Dictionaries¶
- Fixed Dictionaries:
- ASCII, UTF-8, and RLE use fixed dictionaries.
- Static Dictionary:
- Huffman uses a static dictionary, which remains the same for the entire encoding/decoding process.
- Ingredient 2 for LZW: Adaptive Encoding
- Starts with a fixed initial dictionary \(D_0\) (usually ASCII).
- For \(i \geq 0\), \(D_i\) is used to determine the \(i\)-th output character.
- After writing the \(i\)-th character, the encoder updates \(D_i\) to \(D_{i+1}\).
- Challenge:
- Decoder must be able to reconstruct how the encoder updated the dictionary from the coded text.
LZW Overview¶
- Objective: Convert \(S\) into a list of code-numbers.
- Initial Setup:
- Start with dictionary \(D_0\) storing ASCII characters (code-numbers \(0, \ldots, 127\)).
- Each step adds a multi-character string to the dictionary with code-numbers \(128, 129, \ldots\).
- Encoding Process:
- Find longest string \(w\) in \(D_i\):
- \(w\) is the longest string of characters that can be encoded with one number.
- Add useful substring to \(D_i\):
- Add \(wc\) to the dictionary, where \(c\) is the next character following \(w\) in \(S\).
- Find longest string \(w\) in \(D_i\):
- Dictionary Data Structure:
- Use a trie to match characters efficiently.
- To find \(w\):
- Parse the trie until reaching a node \(z\) where no child matches the next character \(c\).
- To update \(D_i\):
- Add a child to \(z\) with a link labeled \(c\).
LZW decoding pseudocode¶
**LZW::decoding(C, S)**
C: input-stream of integers, S: output-stream
D ← dictionary that maps {0,...,127} to ASCII
idx ← 128
k ← C.pop(); s ← D.search(k); S.append(s)
while there are more codes in C do
s_prev ← s; k ← C.pop()
if k < idx do s ← D.search(k)
else if k = idx do s ← s_prev ∘ s_prev[0] // special situation
else return FAIL // invalid encoding!
append each character of s to S
D.insert(idx, s_prev ∘ s[0])
idx++
Lempel-Ziv-Welch decoding details¶
- Dictionary \(D\) maps consecutive integers to words. Use an array!
- Run-time: \(O(|S|)\)
- \(\Theta(|s|)\) each round to append \(s\) to output.
- Everything else takes no longer.
- Dictionary wastes space: Words may get long
- To save space, store string as code of prefix + one character.
- Can still look up \(s\) in \(O(|s|)\) time.
- So run-time remains \(O(|S|)\).
**LZW::dictionary-lookup(D, k)**
s ← empty word
while k > 127 do (k, c) ← D[k], s.prepend(c)
s.prepend(D[k])
Lempel-Ziv-Welch discussion¶
- Encoding and decoding take \(O(|S|)\) time (assuming constant alphabet).
- Efficiency:
- Encoding and decoding need to go through the string only once.
- This allows compression to be performed while streaming the text.
- Conversion to bitstring:
- So far, encoding used numbers. Convert to bitstring using fixed-length encoding.
- Example: Use length-12 encoding, e.g., \(129 = 000010000001\).
- Stop adding to dictionary \(D\) once reaching code-number \(2^{12} - 1 = 4095\).
- Compression efficiency:
- Achieves approximately 45% compression on English text.
Combining Compression Schemes: bzip2¶
bzip2 overview¶
Idea: Combine multiple compression schemes!
The key ingredient is to use text transforms: change input into a different text that compresses better.
Steps
- Input Text: \(T_0\)
- Starting point.
- Burrows-Wheeler Transform: \(T_0 \to T_1\)
- \(T_1\) is a permutation of \(T_0\).
- If \(T_0\) has repeated substrings, then \(T_1\) will have long runs of characters.
- Move-to-Front Transform: \(T_1 \to T_2\)
- \(T_2\) uses ASCII-numbers.
- If \(T_1\) has long runs of characters, then \(T_2\) will have:
- Long runs of zeros.
- Skewed frequencies (some numbers appear more often than others).
- Modified Run-Length Encoding (RLE): \(T_2 \to T_3\)
- If \(T_2\) has long runs of zeros, then \(T_3\) is shorter.
- Skewed frequencies remain.
- Huffman Encoding: \(T_3 \to T_4\)
- Compresses well because the frequencies are skewed.
- Final compressed output.
Each step in the transformation makes the text more suitable for compression, producing smaller outputs.
bzip2 — The Easy Steps¶
Move-to-Front Transform¶
- Dictionary \(D : \{0, \dots, 127\} \to \text{ASCII}\) is an unordered array with MTF.
- Character \(c\) gets mapped to index \(i\) with \(D[i] = c\).
- A character in \(S\) repeats \(k\) times \(\iff C\) has a run of \(k-1\) zeroes.
- We would expect lots of small numbers in the output.
Modified Run-Length Encoding¶
- Input is a list of "characters" in \(\{0, \dots, 127\}\).
- Encode only the runs of 0s (other numbers unchanged).
- Encode run-length \(k\) with bijective base-2 encoding (shorter), using two new characters \(A, B\).
Huffman Encoding¶
- Exactly as seen before.
Burrows-Wheeler Transform¶
Burrows-Wheeler Transform (BWT)¶
Given
- Source text \(S\) as an array with an end-sentinel (e.g.,
'$').
Steps
- Write Down Cyclic Shifts:
- Generate all cyclic shifts of the string \(S\).
- \(i\)th cyclic shift: move \(i\) characters from front to back.
- Generate all cyclic shifts of the string \(S\).
- Sort Lexicographically:
- Arrange all cyclic shifts in lexicographical order using MSD-radix-sort.
- \(\Theta(n)\) strings of length \(\Theta(n) \Rightarrow\) \(\Theta(n^2)\) worst-case time
- Extract Rightmost Column:
- Take the last character of each sorted cyclic shift to form the BWT result.
Properties
- The Burrows-Wheeler Transform (BWT) consists of the last characters of the cyclic shifts of \(S\) after lexicographical sorting.
- Observations:
- The result of the BWT is a permutation of \(S\).
- Repeated substrings in \(S\) cause runs of repeated characters in the BWT result, which is helpful for compression.
Fast Burrows-Wheeler Encoding¶
Input:
- \(S\) with an end-sentinel (
$).
Steps:
- Compute all cyclic shifts of \(S\).
- Sort cyclic shifts lexicographically to form the sorted order.
- Extract the last column (rightmost character of each sorted cyclic shift).
Observations:
- The same sorting permutation for cyclic shifts can be used for suffixes.
- This results in the suffix array (\(A^S\)), which can be computed in \(O(n \log n)\) time.
BWT Encoding:
The BWT encoding is read directly using:
Notes:
- Suffix Array (\(A^S\)):
- Represents the starting indices of sorted suffixes in \(S\).
- This faster approach avoids explicitly constructing all cyclic shifts, saving time and space.
BWT Decoding Code¶
**BWT::decoding(C, S)**
C : string of characters, one of which (not necessarily last one) is $
S : output-stream
Initialize array A // leftmost column
for all indices i of C
A[i] ← (C[i], i) // store character and index
Stably sort A by first aspect
for all indices j of C // where is the $-char?
if C[j] = $ break
do
S.append(character stored in A[j]) // extract source text
j ← index stored in A[j]
while appended character is not $
BWT and bzip2 Discussion¶
BWT Costs:
- BWT Encoding Cost: \(O(n \log n)\)
- Requires reading the encoding from the suffix array.
- BWT Decoding Cost: \(O(n + |\Sigma_S|)\)
- Faster than encoding.
- Both encoding and decoding use \(O(n)\) space.
Characteristics:
- Requires all of the text (no streaming is possible).
- BWT (and hence bzip2) is a block compression method, compressing one block at a time.
bzip2 Costs:
- bzip2 Encoding Cost: \(O(n (\log n + |\Sigma|))\) with a large constant.
- bzip2 Decoding Cost: \(O(n |\Sigma|)\) with a large constant.
General Observations:
- bzip2 is slower than many other compression methods.
- However, it achieves better compression due to the transformations (like BWT).