Skip to content

Module 8: Range-Searching in Dictionaries for Points

Range Searches

  • So far: search($k$) looks for one specific item.
  • New operation range-search: look for all items that fall within a given range.

    • Input: A range, i.e., an interval \(Q = (x, x')\). It may be open or closed at the ends.
    • Goal: Report all KVPs in the dictionary whose key \(k\) satisfies \(k \in Q\).

    image.png

  • As usual, \(n\) denotes the number of input items.

  • Let \(s\) be the output size, i.e., the number of items in the range.
  • We need \(\Omega(s)\) time simply to report the items.
  • Note that sometimes \(s = 0\) and sometimes \(s = n\); we therefore keep it as a separate parameter when analyzing the runtime.

Typical runtime: \(O(\log n + s)\).

Range Searches in Existing Dictionary Realizations

  • Unsorted list/array/hash table: Range search requires \(\Omega(n)\) time:
    • We have to check each item explicitly to see if it is in the range.
  • Sorted array: Range search in \(A\) can be done in \(O(\log n + s)\) time:

    image.png

    • Using binary search, find \(i\) such that \(x\) is at (or would be at) \(A[i]\).
    • Using binary search, find \(i'\) such that \(x'\) is at (or would be at) \(A[i']\).
    • Report all items \(A[i+1 \dots i' - 1]\).
    • Report \(A[i]\) and \(A[i']\) if they are in range.
    • BST: Range searches can similarly be done in \(O(\text{height} + s)\) time.
    • We will see this in detail later.

Multi-Dimensional Data

Range searches are of special interest for multi-dimensional data. Example: flights that leave between 9am and noon, and cost \(400–\)600.

image.png

  • Each item has \(d\) aspects (coordinates): \((x_0, x_1, \dots, x_{d-1})\), so it corresponds to a point in \(d\)-dimensional space.
  • We concentrate on \(d = 2\), i.e., points in the Euclidean plane.
  • (Orthogonal) \(d\)-dimensional range search: Given a query rectangle \(Q = [x_1, x_1'] \times \cdots \times [x_d, x_d']\), find all points that lie within \(Q\).

The time for range searches depends on how the points are stored.

  • Could store a 1-dimensional dictionary (where the key is some combination of the aspects).
    • Problem: Range search on one aspect is not straightforward.
  • Could use one dictionary for each aspect.
    • Problem: Inefficient, wastes space.
  • Better idea: Design new data structures specifically for points.
    • Quadtrees
    • kd-trees
    • range-trees
  • Assumption: Points are in general position:
    • No two points on a horizontal line.
    • No two points on a vertical line.

This simplifies presentation; data structures can be generalized.

Quadtrees

  • We have \(n\) points in the plane, represented as \(P = \{(x_0, y_0), (x_1, y_1), \dots, (x_{n-1}, y_{n-1})\}\).
  • Define a bounding box \(R = [0, 2^k] \times [0, 2^k]\) to contain all points.
    • Assume all coordinates are non-negative.
    • Choose the smallest \(k\) such that the box covers all points.

Quadtree Structure and Construction

  • Root node represents region \(R\).
  • If \(R\) has 0 or 1 point, the root is a leaf storing that point.
  • Otherwise, split \(R\) into four quadrants: \(R_{NE}, R_{NW}, R_{SW}, R_{SE}\).
    • Partition points into sets based on quadrants.
    • Convention: Points on split lines go to the right/top side.
  • Recursively build subtrees for each quadrant, making them children of the root.

Quadtree Dictionary Operations

  • search: Similar to binary search trees and tries.
  • insert:
    • Search for the point.
    • Split the leaf if there are two points in the same region.
  • delete:
    • Search for the point and remove it.
    • If the parent has only one point left, delete the parent (and recursively remove ancestors with only one point left).
**QTree::range-search(r ← root, Q)**
r: The root of a quadtree, Q: Query-rectangle

R ← region associated with node r
if (R ⊆ Q) then                    // inside node, stop searching
       report all points below r and return
else if (R ∩ Q is empty) then return // outside node, stop searching
                                                                   // boundary node, recurse
if (r is a leaf) then
     p ← point stored at r
     if p is not NULL and in Q then report it and return
     else return
for each child v of r do QTree::range-search(v, Q)

Note: We assume each node of the quadtree stores the associated square. Alternatively, these could be re-computed during the search (space-time tradeoff).

Quadtree Analysis

  • Height of a Quadtree:

    • Height can be very large for uneven distributions of points.
    • Even with as few as 3 points, height could be arbitrarily large.

    image.png

  • Bound on Height:

    • There exists a weaker bound based on the spread factor of points \(P\):
    \[ \text{spread factor} = \frac{\text{sidelength of } R}{\text{minimum distance between points in } P} \]
    • Height \(h\) of the quadtree is in \(\Theta(\log(\text{spread factor}))\).
    • Complexity:
    • Initial tree building: \(\Theta(nh)\) in the worst case.
    • Range search: \(\Theta(nh)\) in the worst case, even if no points match.
    • Quadtrees can be generalized to higher dimensions by splitting into octants (resulting in octrees) but are rarely used beyond 3 dimensions.

Quadtree Summary

  • Easy to compute and manage.
  • No complex arithmetic; only divisions by 2 (bit-shift) if bounding box \(R\) width/height is a power of 2.
  • Space can be wasteful but effective if points are well-distributed.
  • Variations:
    • Stop splitting earlier and allow up to \(K\) points per leaf (for a fixed bound \(K\)).
    • Use quadtree for storing pixelated images.

image.png

kd-Trees

  • Given \(n\) points \(P = \{(x_0, y_0), (x_1, y_1), \dots, (x_{n-1}, y_{n-1})\}\).
  • Quadtree: Splits into quadrants regardless of point positions.
  • KD-Tree:
    • Splits at the upper median of coordinates (roughly half of the points in each subtree).
    • Each node has a splitting line in one dimension (in 2D, either vertical or horizontal).
    • Convention: Points on split lines belong to the right/top side.
    • Alternates between vertical and horizontal splits until each point is in its own region.
  • Alternative Splitting: Use the dimension with better aspect ratios for resulting regions.

image.png

Constructing KD-Trees

  1. Start with an initial split by the x-coordinate on points \(P\):
    • If \(|P| \leq 1\), create a leaf node and return.
    • Otherwise:
      • Set \(X := \text{randomized-quick-select}(P, \lfloor n/2 \rfloor)\) to select a median point by x-coordinate.
      • Partition \(P\) by x-coordinate into:
        • \(P_{x < X}\): Points with x-coordinates less than \(X\).
        • \(P_{x \geq X}\): Points with x-coordinates greater than or equal to \(X\).
        • Distribute \(\lfloor n/2 \rfloor\) points on one side and \(\lceil n/2 \rceil\) points on the other.
        • Note: Points are assumed to be in general position.
  2. Recursively create subtrees:
    • Left subtree for \(P_{x < X}\) (split by \(y\)-coordinate).
    • Right subtree for \(P_{x \geq X}\) (split by \(y\)-coordinate).
  3. Symmetry: Start building with an initial \(y\)-split if desired.

KD-Tree Analysis

Run-Time

  • Partitioning: \(P\) is divided in \(\Theta(n)\) expected time using randomized-quick-select.
  • Recurrence:
\[ T^{exp}(n) = 2T^{exp}(n/2) + O(n) \]

Solves to \(\Theta(n \log n)\) expected time.

Can be reduced to \(\Theta(n \log n)\) worst-case time with pre-sorting.

Height

  • Relation: \(h(1) = 0\) and \(h(n) \leq h(\lceil n/2 \rceil) + 1\)
  • Result: Resolves to \(O(\log n)\) height, specifically \(\lceil \log n \rceil\).

Space

  • Interior Nodes: All interior nodes have two children.
  • Total Nodes: \(n - 1\) interior nodes for \(n\) leaves.
  • Space Complexity: \(\Theta(n)\).

KD-Tree Dictionary Operations

  • search: Similar to binary search tree, using specified coordinate.
  • insert: Locate position, add as new leaf.
  • delete: Locate and remove leaf.

Problem

  • Issue: After insertions or deletions, splits may not align with the median, and height may exceed \(\lceil \log_2 n \rceil\).
  • Solution: Maintain \(O(\log n)\) height by occasionally rebuilding subtrees (without details here), though this slows down range searches.

Note: KD-trees are not efficient for frequent insertions or deletions.

  • Range search in a KD-tree is similar to that in a quadtree.
  • Differences:
    • Only two children per node.
    • Leaves always store points.
**kdTree::range-search(r ← root, Q)**
r: The root of a kd-tree, Q: Query-rectangle

R ← region associated with node r
if (R ⊆ Q) then report all points below r; return
if (R ∩ Q is empty) then return
if (r is a leaf) then
  p ← point stored at r
  if p is in Q return p
  else return
for each child v of r do kdTree::range-search(v, Q)
  • Node Storage: Each node keeps its associated region.
  • Space Optimization: Pass the region as a parameter to compute each child's region based on the splitting line.

Range Search Complexity

  • We spend \(O(1)\) time at each visited node, except in line 2 if (R ⊆ Q) then report all points below r; return.
  • All calls to line 2 together take \(O(s)\) time (recall: \(s\) is the output-size).
  • Observe: The number of visited nodes is \(O(\beta(n))\), where \(\beta(n)\) is the number of "boundary" nodes:
    • kdTree::range-search was called.
    • Neither \(R \subseteq Q\) nor \(R \cap Q = \emptyset\).
  • \(\beta(n)\) satisfies the following recurrence relation:

    \[ \beta(n) \leq 2\beta(n/4) + O(1) \]
  • This implies \(\beta(n) \in O(\sqrt{n})\).

  • Therefore, the complexity of range search in kd-trees is \(O(s + \sqrt{n})\).

Higher Dimensions

  • kd-trees for \(d\)-dimensional space:
    • At the root, the point set is partitioned based on the first coordinate.
    • At the subtrees of the root, the partition is based on the second coordinate.
    • At depth \(d - 1\), the partition is based on the last coordinate.
    • At depth \(d\), we start all over again, partitioning on the first coordinate.
  • Storage: \(O(n)\)
  • Height: \(O(\log n)\)
  • Construction time: \(O(n \log n)\)
  • Range search time: \(O(s + n^{1 - 1/d})\)
    • This assumes that \(d\) is a constant.

Range Trees

Towards Range Trees

  • Current limitations:
    • Quadtrees and kd-trees are intuitive and simple.
    • Both may be slow for range searches.
    • Quadtrees can be space-inefficient.
  • New Idea: Range trees
    • Tree of trees (a multi-level data structure)
      • Traditional nodes store key-value pairs and references to children and possibly the parent.
      • Range trees allow more information to be stored in each node.
      • Each node contains another binary search tree.
  • Benefits:
    • Though space-wasteful, range trees enable much faster range searches.

2-dimensional Range Trees

  • Primary structure:
    • A balanced binary search tree \(T\) that stores \(P\) and uses \(x\)-coordinates as keys.
  • Associate structure:
    • Each node \(z\) in \(T\) has an associated structure \(T_{ass}(z)\).
    • \(P(z)\): All points in the subtree of \(z\) in \(T\), including the point at \(z\).
    • \(T_{ass}(z)\): Stores \(P(z)\) in a balanced binary search tree using \(y\)-coordinates as keys.
    • Note: The point of \(z\) is not necessarily the root of \(T_{ass}(z)\).

image.png

Range Tree Space Analysis

  • Primary tree \(T\) uses \(O(n)\) space.
  • Question: How many nodes do all associate trees together have?
    • Point of \(a\) is only in associate tree \(T_{ass}(a)\).
    • Point of \(b\) is in associate trees \(T_{ass}(a), T_{ass}(b)\).
    • Point of \(c\) is in associate trees \(T_{ass}(a), T_{ass}(b), T_{ass}(c)\).
  • Key Insight:
    • Point of \(z\) is in associate tree \(T_{ass}(u)\) if and only if \(u\) is an ancestor of \(z\) in \(T\).
    • Therefore, each point belongs to \(O(\log n)\) associate trees.
    • As a result, all associate trees together use \(O(n \log n)\) space.
  • Conclusion: A range-tree with \(n\) points uses \(O(n \log n)\) space, which is tight for some primary trees.

image.png

Range Trees Operations

  • search: Search by \(x\)-coordinate in \(T\).
  • insert:
    • Insert point by \(x\)-coordinate into \(T\).
    • Walk back up to the root, inserting by \(y\)-coordinate in all associate trees \(T_{ass}(z)\) of nodes \(z\) on the path to the root.
  • delete: Analogous to insertion.
  • Problem: Ensuring balanced binary search trees.
    • Using AVL-trees makes insert/delete slow due to rotations, which change \(P(v)\) and require re-building \(T_{ass}(v)\).
    • Solution: Rebuild highly unbalanced subtrees.
    • Runtime for insert/delete: \(O(\log^2 n)\) amortized.
  • range-search:
    • Search by \(x\)-range in \(T\).
    • For found points, search by \(y\)-range in associated trees.
  • Note: Must understand 1-dimensional range search in binary search trees.

BST Range Search recursive

**BST::range-search-recursive(r ← root, x₁, x₂)**
r: root of a binary search tree, x₁, x₂: search keys
Returns keys in subtree at r that are in range [x₁, x₂]

if r = NULL then return
if x₁ ≤ r.key ≤ x₂ then
    L ← BST::range-search-recursive(r.left, x₁, x₂)
    R ← BST::range-search-recursive(r.right, x₁, x₂)
    return L ∪ r.{key} ∪ R
if r.key < x₁ then
    return BST::range-search-recursive(r.right, x₁, x₂)
if r.key > x₂ then
    return BST::range-search-recursive(r.left, x₁, x₂)

Note: Keys are reported in in-order, i.e., in sorted order.

BST Range Search Re-phrased

image.png

  • Search for left boundary \(x_1\): this defines path \(P_1\).
  • Search for right boundary \(x_2\): this defines path \(P_2\).
  • The tree \(T\) is divided into three sections:
    • Outside the range.
    • On the paths \(P_1\) and \(P_2\).
    • Between the paths.

This classification of nodes will be essential for further analysis.

  • Boundary nodes: Nodes in paths \(P_1\) or \(P_2\)
    • For each boundary node, check if it is within the range.
  • Outside nodes: Nodes to the left of \(P_1\) or right of \(P_2\)
    • These nodes are outside the range and are not visited.
  • Inside nodes: Nodes to the right of \(P_1\) and left of \(P_2\)
    • Keep a list of the topmost inside nodes.
    • All descendants of these nodes are within the range, so for a 1-dimensional range search, report them directly.

BST Range Search Analysis

Assuming a balanced binary search tree:

  • Path Searches:
    • Search for path \(P_1\): \(O(\log n)\)
    • Search for path \(P_2\): \(O(\log n)\)
  • Boundary Nodes:
    • There are \(O(\log n)\) boundary nodes, each taking \(O(1)\) time.
  • Topmost Inside Nodes:
    • Each topmost inside node \(v\) requires \(O(1)\) time.
    • They are children of boundary nodes, thus taking \(O(\log n)\) time.
  • Descendants:
    • For 1D range search, also report descendants of topmost inside nodes.
    • Sum of descendants across topmost inside nodes is \(\leq s\), as their subtrees are disjoint. This takes \(O(s)\) time.

Run-time for 1D Range Search: \(O(\log n + s)\). While not faster overall, the concept of topmost inside nodes is critical for 2D range search.

To perform a range search for \(Q = [x_1, x_2] \times [y_1, y_2]\), follow these steps:

  1. Range Search on \(~~x~~\)-coordinates: Perform a range search on the \(x\)-coordinates for the interval \([x_1, x_2]\) in the primary tree \(T\): BST::range-search(T, x1, x2)
  2. Boundary and Topmost Inside Nodes:
    • Identify boundary and topmost inside nodes as usual.
  3. Processing Boundary Nodes:
    • For each boundary node, check if the corresponding point lies within region \(Q\).
  4. Processing Topmost Inside Nodes:
    • For each topmost inside node \(v\):
      • Let \(P(z)\) represent points in the subtree of \(z\) in \(T\).
      • Since all \(x\)-coordinates in \(P(z)\) are within range, proceed with \(y\)-coordinates.
      • Range Search on \(~~y~~\)-coordinates:
        • Use \(T_{ass}(z)\), the associated tree of \(z\), to find points where \(y\)-coordinates fall within \([y_1, y_2]\): BST::range-search(T_ass(z), y1, y2)

Range Trees: Range Search Run-time

  • \(O(\log n)\) time to find boundary and topmost inside nodes in the primary tree.
  • There are \(O(\log n)\) such nodes.
  • \(O(\log n + s_v)\) time for each topmost inside node \(v\), where \(s_v\) is the number of points in \(T_{\text{ass}}(v)\) that are reported.
  • Two topmost inside nodes have no common point in their trees:
    • \(\Rightarrow\) every point is reported in at most one associate structure
    • \(\Rightarrow \sum_{v \text{ topmost inside}} s_v \leq s\)

Time for range search in range-tree is proportional to

\[ \sum_{v \text{ topmost inside}} (\log n + s_v) \in O(\log^2 n + s) \]

Range Trees: Higher Dimensions

  • Range trees can be generalized to \(d\)-dimensional space.
    • Space: \(O(n (\log n)^{d-1})\)
    • Construction time: \(O(n (\log n)^d)\)
    • Range search time: \(O(s + (\log n)^d)\)

(Note: \(d\) is considered to be a constant.)

  • Space/Time trade-off compared to kd-trees.

image.png

Conclusion

Range search data structures summary

  • Quadtrees
    • Simple (also for dynamic set of points)
    • Work well only if points are evenly distributed
    • Wastes space for higher dimensions
  • kd-trees
    • Linear space
    • Range search time \(O(\sqrt{n} + s)\)
    • Inserts/deletes destroy balance and range search time (no simple fix)
  • Range-trees
    • Range search time \(O(\log^2 n + s)\)
    • Wastes some space
    • Inserts/deletes destroy balance (can fix this with occasional rebuild)

Convention:

Points on split lines belong to the right/top side.

image.png