Skip to content

Module 10: Data Compression

Background

Data Compression Introduction

The problem: How to store and transmit data efficiently?

  • Source text: The original data, string \(S\) of characters from the source alphabet \(\Sigma_S\)
  • Coded text: The encoded data, string \(C\) of characters from the code alphabet \(\Sigma_C\)
  • Encoding: An algorithm mapping source texts to coded texts
  • Decoding: An algorithm mapping coded texts back to their original source text
\[ S \xrightarrow{\text{encode}} C \xrightarrow{\text{transmit}} C \xrightarrow{\text{decode}} S \]
  • Source "text" can be any sort of data (not always text!)
  • The code alphabet \(\Sigma_C\) is usually \(\{0, 1\}\).
  • We consider here only lossless compression: Can recover \(S\) from \(C\) without error.

Judging Encoding Schemes

Main objective: Minimize the size of the coded text by measuring the compression ratio:

\[ \frac{|C| \cdot \log |\Sigma_C|}{|S| \cdot \log |\Sigma_S|} \]

Examples

  • \(73 = (73)_{10} \to (1001001)_{2}\)

    Compression ratio: \(\frac{7 \cdot \log 2}{2 \cdot \log 10} \approx 1.05\)

  • \(127 = (127)_{10} \to (7F)_{16}\)

    Compression ratio: \(\frac{2 \cdot \log 16}{3 \cdot \log 10} \approx 0.8\)

Efficiency

  • Encoding/decoding requires time \(\Omega(|S| + |C|)\), sometimes more.

Other goals (not covered here)

  • Reliability (e.g., error-correcting codes)
  • Security (e.g., encryption)

Impossibility of Compressing

Observation: No lossless encoding scheme can achieve a compression ratio less than 1 for all input strings.

Proof (for \(\Sigma_S = \Sigma_C = \{0, 1\}\)):

  • Fix a size \(n\).
  • Assume for contradiction that all length-\(n\) bitstrings are compressed to a shorter length (at most \(n-1\)).
  • This leads to a mismatch in set sizes:
    • Bitstrings of length \(n\): \(2^n\) elements.
    • Bitstrings of length \(\leq n-1\): \(2^{n-1} + 2^{n-2} + \dots + 2^0 < 2^n\).

Since the compressed set cannot cover all possible input strings, the assumption is false.

Conclusion:

  • Worst-case compression bounds are inherently limited.
  • However, real-life data often contains patterns, allowing effective compression.

Detour: Streams (Review)

Large texts often exceed memory capacity, so we use streams to handle the input and output for storing \(S\) and \(C\).

image.png

  • Input-stream (\(\sim\) std::cin): Reads one character at a time.
    • Uses a stack-like interface: top / pop / is-empty.
  • Output-stream (\(\sim\) std::cout): Writes one character at a time.
    • Uses a queue-like interface: append.

Advantages

  • Enables processing text as it loads.
  • Some algorithms may require resetting the input-stream.

Single-Character Encodings

Character Encodings

A character encoding (or single-character encoding) maps each character in the source alphabet to a string in the code alphabet:

\[ E : \Sigma_S \to \Sigma_C^* \]

ASCII uses a (length-7) fixed-length code where each codeword \(E(c)\) has the same length.

Encoding/decoding is straightforward by concatenating or decoding the next 7 bits.

Example for APPLE: \(\leftrightarrow (65, 80, 80, 76, 69)\) \(\leftrightarrow 1000001\ 1010000\ 1010000\ 1001100\ 1000101\)

Earlier fixed-length codes include Caesar shift, Baudot code, and Murray code. Fixed-length codes are simple but do not compress data.

Variable-Length Codes

Better idea: Use variable-length codes.

  • Motivation: Some letters in \(\Sigma_S\) occur more frequently than others. Example: Letter frequencies in English text: \(e: 12.70\%\), \(t: 9.06\%\), \(a: 8.17\%\), \(o: 7.51\%\), \(i: 6.97\%\), \(n: 6.75\%\), etc.

Idea: Assign shorter codes to more frequent characters.

Similar to fixed-length codes, map the source alphabet to codewords \(E : \Sigma_S \to \Sigma_C^*\), but:

  • Codewords do not need to be of the same length.
  • This reduces the overall size of the encoded text.

Encoding

Assume a character encoding \(E : \Sigma_S \to \Sigma_C^*\), where \(E\) is a dictionary with keys in \(\Sigma_S\).

**singleChar::encoding(E, S, C)**
E: the encoding dictionary
S: input-stream with characters in Σ_S C : output-stream of bits

while S is non-empty
        w ← E.search(S.pop())
        append each bit of w to C

Decoding

The decoding algorithm must map \(\Sigma_C^*\) to \(\Sigma_S^*\).

  • The code must be lossless, meaning it is uniquely decodable.
    • (Morse code uses pauses as end-sentinels to avoid ambiguity.)

From now, we only consider prefix-free codes \(E\):

  • No codeword is a prefix of another.
  • This corresponds to a trie where characters of \(\Sigma_S\) only appear at the leaves.

Prefix-free codes eliminate the need for an end-sentinel \(\$\).

Decoding of Prefix-Free Codes

Any prefix-free code is uniquely decodable.

**prefixFree::decoding(C, S, T)**
C: input-stream with characters in Σ_C
S: output-stream
T: encoding-trie

while C is non-empty
        z ← T.root
        while z is not a leaf
                if C is empty or z has no child labelled C.top()
                        return "invalid encoding"
                z ← child of z that is labelled with C.pop()
        S.append(character stored at z)

Run-time: \(O(|C|)\)

Encoding from Trie

**prefixFree::encoding(S, C, T)**
S: input-stream with characters in Σ_S
C: output-stream
T: encoding trie

E ← array of trie-nodes indexed by Σ_S
for all leaves ℓ in T do E[character at ℓ] ← ℓ
while S is non-empty
        w ← empty bitstring
        v ← E[S.pop()]
        while v is not the root
                w.prepend(character from v to its parent)
        append each bit of w to C
  • Run-time: \(O(|T| + |C|)\)
  • Assumes all interior nodes of \(T\) have two children; otherwise, the encoding scheme can be improved.
  • Therefore, \(|T| \leq 2|\Sigma_S| - 1\), leading to a run-time of \(O(|\Sigma_S| + |C|)\).

Huffman’s Encoding Trie

Huffman's Algorithm: Building the Best Trie

Objective: Determine the "best" trie for a given source text \(S\).

  • Idea: Frequent characters should have short codewords. Infrequent characters should be "far down" in the trie.

Greedy Algorithm

  1. Initialization: Start with a set of encoding tries, where each character \(c \in \Sigma_S\) adds a height-0 trie holding \(c\).
  2. Weight: Each trie has a weight equal to the sum of frequencies of its characters (pre-computed).
  3. Merge Tries: Find the two tries with the minimum weights and merge them.
    • This corresponds to adding one bit to the encoding of each character.
  4. Repeat: Continue merging until only one trie remains.

Efficiency:

  • Use a min-ordered heap for storage.
  • Step 3 requires two delete-min operations and one insert.

Huffman's Algorithm: Pseudocode

**Huffman::encoding(S, C)**
S: input-stream with characters in Σ_S
C: output-stream

f ← array indexed by Σ_S, initially all-0 // frequencies
while S is non-empty do increase f[S.pop()] by 1
Q ← min-oriented priority queue that stores tries // initialize PQ
for all c ∈ Σ_S with f[c] > 0 do Q.insert(c, f[c])
while Q.size() > 1 do // build decoding trie
        (T1, f1) ← Q.delete-min()
        (T2, f2) ← Q.delete-min()
        Q.insert(trie with T1, T2 as subtries, f1+f2)
T ← Q.delete-min()
C.append(encoding trie T) // send trie
Re-set input-stream S
prefixFree::encoding(S, C, T) // actual encoding

Note: This assumes \(S\) has at least two distinct characters.

Huffman Coding Discussion

Optimality: The constructed trie is optimal, producing the shortest \(C\) among prefix-free single-character encodings with \(\Sigma_C = \{0, 1\}\).

Limitations

  • Constructed trie is not unique unless tie-breaking rules are provided.
  • Decoder does not know the trie:
    • Must either send the decoding trie (complex bitstring conversion).
    • Or send character frequencies and tie-breaking rules.
    • Both methods increase the encoded text length.
  • Encoding requires two passes:
    • First to compute frequencies.
    • Second to encode.
    • Cannot use a stream unless it supports re-setting.

Run-time

  • Encoding: \(O(|\Sigma_S| \log |\Sigma_S| + |C|)\)
  • Decoding: \(O(|C|)\)

There are many variations, such as handling unused characters, frequency estimation, or adaptive encoding adjustments.

Run-Length Encoding

Run-Length Encoding Idea

  • Definition: A form of multi-character encoding where multiple source-text characters are represented by a single code-word.
  • Requirements:
    • Input must be a bitstring \((\Sigma_S = \{0, 1\})\).
    • The decoding dictionary is implicit and uniquely defined.
  • Use Case: Effective when the input \(S\) has long runs of the same character, e.g., \(00000 \; 111 \; 0000000\).
  • Encoding Process:
    1. Send the leftmost bit of \(S\) (either 0 or 1).
    2. Follow with a sequence of integers representing run lengths.
    3. Example for \(00000 \; 111 \; 0000000\): \(0, 5, 3, 7\) (no need to specify bits as runs alternate).

Detour: Encoding Positive Integers

Encoding Techniques:

  • Base-2 Encoding:

    \(E_{b2}(k)\) is the string \(w\) such that \((w)_2 = k\).

  • Bijective Base-2 Encoding:

    \(E_{bb2}(k) := E_{b2}(k+1)\) with the leading 1 removed.

    (Bijective base-26 encoding is used to enumerate spreadsheet columns.)

    image.png

For encoding a sequence of integers, a separator-symbol (e.g., #) is used:

  • Example 1:

    \(5, 3, 7 \to 101\#11\#111\)

    Encoding length: \(|C| \log |\Sigma_C| = 10 \log(3) \approx 15.8\)

  • Example 2:

    \(5, 3, 7 \to BA\#AA\#AAA\)

    Encoding length: \(|C| \log |\Sigma_C| = 9 \log(3) \approx 14.22\)

Run-Length Encoding Details

Elias Gamma Code (for an integer \(k\)):

  • To encode \(k\):

    1. Take binary representation \(E_{b_2}(k)\) of \(k\).
    2. Add leading zeros so that the initial \(1\) is in the middle.
    3. This means adding \(\lfloor \log k \rfloor\) leading zeros.

    image.png

Definition: To perform run-length encoding:

  1. Write the initial bit to output.
  2. Determine the run lengths \(k_1, k_2, \ldots, k_d\).
  3. Write \(E_{\gamma}(k_1) \cup E_{\gamma}(k_2) \cup \ldots \cup E_{\gamma}(k_d)\) to output.

Note:

Steps 2-3 can be done simultaneously.

RLE Encoding

**RLE::encoding(S, C)**
S: input-stream of bits, C: output-stream

b ← S.top(); C.append(b)
while S is non-empty do
    k ← 1  # k: length of run
    while (S is non-empty and S.top() = b) do
        k++; S.pop()
    w ← empty string  # w: binary encoding of k
    while k > 1
        C.append(0)  # write leading zeroes
        w.prepend(k mod 2)
        k ← ⌊k/2⌋
    w.prepend(1)
    append each bit of w to C
    b ← 1 - b

Claim: A sequence of Elias gamma codes can be decoded uniquely.

  • Each Elias gamma code has the form \(0^\ell 1 \ast^\ell\) (where ‘∗’ means ‘some bit’).
  • Can determine \(\ell\) by scanning until we encounter a 1.
  • Convert this 1 plus the next \(\ell\) bits into the integer.
**RLE::decoding(C, S)**
C: input-stream of bits, S: output-stream

b ← C.pop() // b: bit of the current run
while C is non-empty
   ℓ ← 0     // ℓ: number of leading zeroes
   while C.pop() = 0 do ℓ++
   k ← 1     // k: integer that was encoded
   for (j ← 1 to ℓ) do k ← k * 2 + C.pop()
   for (j ← 1 to k) do S.append(b)
   b ← 1 − b

If C.pop() is called when there are no bits left, then C was not valid input.

RLE Discussion

  • An all-0 string of length \(n\) would be compressed to \(0E_\gamma(n)\), which has \(2 \lfloor \log n \rfloor + 2 \in o(n)\) bits.
  • Usually, we are not that lucky:
    • No compression until run-length \(k \geq 6\).
    • Expansion when run-length \(k = 2\) or \(4\).
  • RLE was popular for fax machines.
  • RLE can be adapted to larger alphabet sizes (but then the encoding of each run must also store the character).

    Used in some image formats (e.g., TIFF).

  • The method can be adapted to encode only runs of 0.

    Used inside bzip2 (see later).

Lempel-Ziv-Welch

Longer Patterns in Input

Huffman and RLE take advantage of frequent/repeated single characters.

Observation:

Certain substrings are much more frequent than others.

  • English text:
    • Most frequent digraphs: TH, ER, ON, AN, RE, HE, IN, ED, ND, HA
    • Most frequent trigraphs: THE, AND, THA, ENT, ION, TIO, FOR, NDE
  • HTML: <a href, <img src,
  • Video: repeated background between frames, shifted sub-image

Ingredient 1 for Lempel-Ziv-Welch compression: take advantage of such substrings without needing to know beforehand what they are.

Adaptive Dictionaries

  • Fixed Dictionaries:
    • ASCII, UTF-8, and RLE use fixed dictionaries.
  • Static Dictionary:
    • Huffman uses a static dictionary, which remains the same for the entire encoding/decoding process.
  • Ingredient 2 for LZW: Adaptive Encoding
    • Starts with a fixed initial dictionary \(D_0\) (usually ASCII).
    • For \(i \geq 0\), \(D_i\) is used to determine the \(i\)-th output character.
    • After writing the \(i\)-th character, the encoder updates \(D_i\) to \(D_{i+1}\).
  • Challenge:
    • Decoder must be able to reconstruct how the encoder updated the dictionary from the coded text.

LZW Overview

  • Objective: Convert \(S\) into a list of code-numbers.
  • Initial Setup:
    • Start with dictionary \(D_0\) storing ASCII characters (code-numbers \(0, \ldots, 127\)).
    • Each step adds a multi-character string to the dictionary with code-numbers \(128, 129, \ldots\).
  • Encoding Process:
    1. Find longest string \(w\) in \(D_i\):
      • \(w\) is the longest string of characters that can be encoded with one number.
    2. Add useful substring to \(D_i\):
      • Add \(wc\) to the dictionary, where \(c\) is the next character following \(w\) in \(S\).
  • Dictionary Data Structure:
    • Use a trie to match characters efficiently.
    • To find \(w\):
      • Parse the trie until reaching a node \(z\) where no child matches the next character \(c\).
    • To update \(D_i\):
      • Add a child to \(z\) with a link labeled \(c\).

LZW decoding pseudocode

**LZW::decoding(C, S)**
C: input-stream of integers, S: output-stream

D ← dictionary that maps {0,...,127} to ASCII
idx ← 128
k ← C.pop(); s ← D.search(k); S.append(s)
while there are more codes in C do
    s_prev ← s; k ← C.pop()
    if k < idx do s ← D.search(k)
    else if k = idx do s ← s_prev ∘ s_prev[0]  // special situation
    else return FAIL  // invalid encoding!
    append each character of s to S
    D.insert(idx, s_prev ∘ s[0])
    idx++

Lempel-Ziv-Welch decoding details

  • Dictionary \(D\) maps consecutive integers to words. Use an array!
  • Run-time: \(O(|S|)\)
    • \(\Theta(|s|)\) each round to append \(s\) to output.
    • Everything else takes no longer.
  • Dictionary wastes space: Words may get long
    • To save space, store string as code of prefix + one character.
    • Can still look up \(s\) in \(O(|s|)\) time.
    • So run-time remains \(O(|S|)\).
**LZW::dictionary-lookup(D, k)**

s ← empty word
while k > 127 do (k, c) ← D[k], s.prepend(c)
s.prepend(D[k])

Lempel-Ziv-Welch discussion

  • Encoding and decoding take \(O(|S|)\) time (assuming constant alphabet).
  • Efficiency:
    • Encoding and decoding need to go through the string only once.
    • This allows compression to be performed while streaming the text.
  • Conversion to bitstring:
    • So far, encoding used numbers. Convert to bitstring using fixed-length encoding.
    • Example: Use length-12 encoding, e.g., \(129 = 000010000001\).
    • Stop adding to dictionary \(D\) once reaching code-number \(2^{12} - 1 = 4095\).
  • Compression efficiency:
    • Achieves approximately 45% compression on English text.

Combining Compression Schemes: bzip2

bzip2 overview

Idea: Combine multiple compression schemes!

The key ingredient is to use text transforms: change input into a different text that compresses better.

Steps

  1. Input Text: \(T_0\)
    • Starting point.
  2. Burrows-Wheeler Transform: \(T_0 \to T_1\)
    • \(T_1\) is a permutation of \(T_0\).
    • If \(T_0\) has repeated substrings, then \(T_1\) will have long runs of characters.
  3. Move-to-Front Transform: \(T_1 \to T_2\)
    • \(T_2\) uses ASCII-numbers.
    • If \(T_1\) has long runs of characters, then \(T_2\) will have:
      • Long runs of zeros.
      • Skewed frequencies (some numbers appear more often than others).
  4. Modified Run-Length Encoding (RLE): \(T_2 \to T_3\)
    • If \(T_2\) has long runs of zeros, then \(T_3\) is shorter.
    • Skewed frequencies remain.
  5. Huffman Encoding: \(T_3 \to T_4\)
    • Compresses well because the frequencies are skewed.
    • Final compressed output.

Each step in the transformation makes the text more suitable for compression, producing smaller outputs.

bzip2 — The Easy Steps

Move-to-Front Transform

  • Dictionary \(D : \{0, \dots, 127\} \to \text{ASCII}\) is an unordered array with MTF.
  • Character \(c\) gets mapped to index \(i\) with \(D[i] = c\).
  • A character in \(S\) repeats \(k\) times \(\iff C\) has a run of \(k-1\) zeroes.
  • We would expect lots of small numbers in the output.

Modified Run-Length Encoding

  • Input is a list of "characters" in \(\{0, \dots, 127\}\).
  • Encode only the runs of 0s (other numbers unchanged).
  • Encode run-length \(k\) with bijective base-2 encoding (shorter), using two new characters \(A, B\).

Huffman Encoding

  • Exactly as seen before.

Burrows-Wheeler Transform

Burrows-Wheeler Transform (BWT)

Given

  • Source text \(S\) as an array with an end-sentinel (e.g., '$').

Steps

  1. Write Down Cyclic Shifts:
    • Generate all cyclic shifts of the string \(S\).
      • \(i\)th cyclic shift: move \(i\) characters from front to back.
  2. Sort Lexicographically:
    • Arrange all cyclic shifts in lexicographical order using MSD-radix-sort.
    • \(\Theta(n)\) strings of length \(\Theta(n) \Rightarrow\) \(\Theta(n^2)\) worst-case time
  3. Extract Rightmost Column:
    • Take the last character of each sorted cyclic shift to form the BWT result.

Properties

  • The Burrows-Wheeler Transform (BWT) consists of the last characters of the cyclic shifts of \(S\) after lexicographical sorting.
  • Observations:
    • The result of the BWT is a permutation of \(S\).
    • Repeated substrings in \(S\) cause runs of repeated characters in the BWT result, which is helpful for compression.

Fast Burrows-Wheeler Encoding

Input:

  • \(S\) with an end-sentinel ($).

Steps:

  1. Compute all cyclic shifts of \(S\).
  2. Sort cyclic shifts lexicographically to form the sorted order.
  3. Extract the last column (rightmost character of each sorted cyclic shift).

Observations:

  • The same sorting permutation for cyclic shifts can be used for suffixes.
  • This results in the suffix array (\(A^S\)), which can be computed in \(O(n \log n)\) time.

BWT Encoding:

The BWT encoding is read directly using:

\[ C[i] = \begin{cases} S[A^S[i] - 1] & \text{if } A^S[i] > 0, \\ \$ & \text{otherwise.} \end{cases} \]

Notes:

  • Suffix Array (\(A^S\)):
    • Represents the starting indices of sorted suffixes in \(S\).
  • This faster approach avoids explicitly constructing all cyclic shifts, saving time and space.

BWT Decoding Code

**BWT::decoding(C, S)**
C : string of characters, one of which (not necessarily last one) is $
S : output-stream

Initialize array A  // leftmost column
for all indices i of C
    A[i] ← (C[i], i)  // store character and index
Stably sort A by first aspect

for all indices j of C  // where is the $-char?
    if C[j] = $ break

do
    S.append(character stored in A[j])  // extract source text
    j ← index stored in A[j]
while appended character is not $

BWT and bzip2 Discussion

BWT Costs:

  • BWT Encoding Cost: \(O(n \log n)\)
    • Requires reading the encoding from the suffix array.
  • BWT Decoding Cost: \(O(n + |\Sigma_S|)\)
    • Faster than encoding.
  • Both encoding and decoding use \(O(n)\) space.

Characteristics:

  • Requires all of the text (no streaming is possible).
  • BWT (and hence bzip2) is a block compression method, compressing one block at a time.

bzip2 Costs:

  • bzip2 Encoding Cost: \(O(n (\log n + |\Sigma|))\) with a large constant.
  • bzip2 Decoding Cost: \(O(n |\Sigma|)\) with a large constant.

General Observations:

  • bzip2 is slower than many other compression methods.
  • However, it achieves better compression due to the transformations (like BWT).