Module 8: Range-Searching in Dictionaries for Points¶

Range Searches¶

So far: search($k$) looks for one specific item.
New operation range-search: look for all items that fall within a given range.
- Input: A range, i.e., an interval $Q = (x, x')$. It may be open or closed at the ends.
- Goal: Report all KVPs in the dictionary whose key $k$ satisfies $k \in Q$.
As usual, $n$ denotes the number of input items.
Let $s$ be the output size, i.e., the number of items in the range.
We need $\Omega(s)$ time simply to report the items.
Note that sometimes $s = 0$ and sometimes $s = n$; we therefore keep it as a separate parameter when analyzing the runtime.

Typical runtime: $O(\log n + s)$.

Range Searches in Existing Dictionary Realizations¶

Unsorted list/array/hash table: Range search requires $\Omega(n)$ time:
- We have to check each item explicitly to see if it is in the range.
Sorted array: Range search in $A$ can be done in $O(\log n + s)$ time:
- Using binary search, find $i$ such that $x$ is at (or would be at) $A[i]$.
- Using binary search, find $i'$ such that $x'$ is at (or would be at) $A[i']$.
- Report all items $A[i+1 \dots i' - 1]$.
- Report $A[i]$ and $A[i']$ if they are in range.
- BST: Range searches can similarly be done in $O(\text{height} + s)$ time.
- We will see this in detail later.

Multi-Dimensional Data¶

Range searches are of special interest for multi-dimensional data. Example: flights that leave between 9am and noon, and cost $400–$600.

Each item has $d$ aspects (coordinates): $(x_0, x_1, \dots, x_{d-1})$, so it corresponds to a point in $d$-dimensional space.
We concentrate on $d = 2$, i.e., points in the Euclidean plane.
(Orthogonal) $d$-dimensional range search: Given a query rectangle $Q = [x_1, x_1'] \times \cdots \times [x_d, x_d']$, find all points that lie within $Q$.

Multi-dimensional Range Search¶

The time for range searches depends on how the points are stored.

Could store a 1-dimensional dictionary (where the key is some combination of the aspects).
- Problem: Range search on one aspect is not straightforward.
Could use one dictionary for each aspect.
- Problem: Inefficient, wastes space.
Better idea: Design new data structures specifically for points.
- Quadtrees
- kd-trees
- range-trees
Assumption: Points are in general position:
- No two points on a horizontal line.
- No two points on a vertical line.

This simplifies presentation; data structures can be generalized.

Quadtrees¶

We have $n$ points in the plane, represented as $P = \{(x_0, y_0), (x_1, y_1), \dots, (x_{n-1}, y_{n-1})\}$.
Define a bounding box $R = [0, 2^k] \times [0, 2^k]$ to contain all points.
- Assume all coordinates are non-negative.
- Choose the smallest $k$ such that the box covers all points.

Quadtree Structure and Construction

Root node represents region $R$.
If $R$ has 0 or 1 point, the root is a leaf storing that point.
Otherwise, split $R$ into four quadrants: $R_{NE}, R_{NW}, R_{SW}, R_{SE}$.
- Partition points into sets based on quadrants.
- Convention: Points on split lines go to the right/top side.
Recursively build subtrees for each quadrant, making them children of the root.

Quadtree Dictionary Operations¶

search: Similar to binary search trees and tries.
insert:
- Search for the point.
- Split the leaf if there are two points in the same region.
delete:
- Search for the point and remove it.
- If the parent has only one point left, delete the parent (and recursively remove ancestors with only one point left).

Quadtree Range Search¶

**QTree::range-search(r ← root, Q)**
r: The root of a quadtree, Q: Query-rectangle

R ← region associated with node r
if (R ⊆ Q) then                    // inside node, stop searching
       report all points below r and return
else if (R ∩ Q is empty) then return // outside node, stop searching
                                                                   // boundary node, recurse
if (r is a leaf) then
     p ← point stored at r
     if p is not NULL and in Q then report it and return
     else return
for each child v of r do QTree::range-search(v, Q)

Note: We assume each node of the quadtree stores the associated square. Alternatively, these could be re-computed during the search (space-time tradeoff).

Quadtree Analysis¶

Height of a Quadtree:
- Height can be very large for uneven distributions of points.
- Even with as few as 3 points, height could be arbitrarily large.
Bound on Height:
- There exists a weaker bound based on the spread factor of points $P$:
\[ \text{spread factor} = \frac{\text{sidelength of } R}{\text{minimum distance between points in } P} \]
- Height $h$ of the quadtree is in $\Theta(\log(\text{spread factor}))$.
- Complexity:
- Initial tree building: $\Theta(nh)$ in the worst case.
- Range search: $\Theta(nh)$ in the worst case, even if no points match.
- Quadtrees can be generalized to higher dimensions by splitting into octants (resulting in octrees) but are rarely used beyond 3 dimensions.

Quadtree Summary¶

Easy to compute and manage.
No complex arithmetic; only divisions by 2 (bit-shift) if bounding box $R$ width/height is a power of 2.
Space can be wasteful but effective if points are well-distributed.
Variations:
- Stop splitting earlier and allow up to $K$ points per leaf (for a fixed bound $K$).
- Use quadtree for storing pixelated images.

kd-Trees¶

Given $n$ points $P = \{(x_0, y_0), (x_1, y_1), \dots, (x_{n-1}, y_{n-1})\}$.
Quadtree: Splits into quadrants regardless of point positions.
KD-Tree:
- Splits at the upper median of coordinates (roughly half of the points in each subtree).
- Each node has a splitting line in one dimension (in 2D, either vertical or horizontal).
- Convention: Points on split lines belong to the right/top side.
- Alternates between vertical and horizontal splits until each point is in its own region.
Alternative Splitting: Use the dimension with better aspect ratios for resulting regions.

Constructing KD-Trees¶

Start with an initial split by the x-coordinate on points $P$:
- If $|P| \leq 1$, create a leaf node and return.
- Otherwise:
  - Set $X := \text{randomized-quick-select}(P, \lfloor n/2 \rfloor)$ to select a median point by x-coordinate.
  - Partition $P$ by x-coordinate into:
    - $P_{x < X}$: Points with x-coordinates less than $X$.
    - $P_{x \geq X}$: Points with x-coordinates greater than or equal to $X$.
    - Distribute $\lfloor n/2 \rfloor$ points on one side and $\lceil n/2 \rceil$ points on the other.
    - Note: Points are assumed to be in general position.
Recursively create subtrees:
- Left subtree for $P_{x < X}$ (split by $y$-coordinate).
- Right subtree for $P_{x \geq X}$ (split by $y$-coordinate).
Symmetry: Start building with an initial $y$-split if desired.

KD-Tree Analysis¶

Run-Time¶

Partitioning: $P$ is divided in $\Theta(n)$ expected time using randomized-quick-select.
Recurrence:

\[ T^{exp}(n) = 2T^{exp}(n/2) + O(n) \]

Solves to $\Theta(n \log n)$ expected time.

Can be reduced to $\Theta(n \log n)$ worst-case time with pre-sorting.

Height¶

Relation: $h(1) = 0$ and $h(n) \leq h(\lceil n/2 \rceil) + 1$
Result: Resolves to $O(\log n)$ height, specifically $\lceil \log n \rceil$.

Space¶

Interior Nodes: All interior nodes have two children.
Total Nodes: $n - 1$ interior nodes for $n$ leaves.
Space Complexity: $\Theta(n)$.

KD-Tree Dictionary Operations¶

search: Similar to binary search tree, using specified coordinate.
insert: Locate position, add as new leaf.
delete: Locate and remove leaf.

Problem¶

Issue: After insertions or deletions, splits may not align with the median, and height may exceed $\lceil \log_2 n \rceil$.
Solution: Maintain $O(\log n)$ height by occasionally rebuilding subtrees (without details here), though this slows down range searches.

Note: KD-trees are not efficient for frequent insertions or deletions.

KD-Tree Range Search¶

Range search in a KD-tree is similar to that in a quadtree.
Differences:
- Only two children per node.
- Leaves always store points.

**kdTree::range-search(r ← root, Q)**
r: The root of a kd-tree, Q: Query-rectangle

R ← region associated with node r
if (R ⊆ Q) then report all points below r; return
if (R ∩ Q is empty) then return
if (r is a leaf) then
  p ← point stored at r
  if p is in Q return p
  else return
for each child v of r do kdTree::range-search(v, Q)

Node Storage: Each node keeps its associated region.
Space Optimization: Pass the region as a parameter to compute each child's region based on the splitting line.

Range Search Complexity¶

We spend $O(1)$ time at each visited node, except in line 2 if (R ⊆ Q) then report all points below r; return.
All calls to line 2 together take $O(s)$ time (recall: $s$ is the output-size).
Observe: The number of visited nodes is $O(\beta(n))$, where $\beta(n)$ is the number of "boundary" nodes:
- kdTree::range-search was called.
- Neither $R \subseteq Q$ nor $R \cap Q = \emptyset$.
$\beta(n)$ satisfies the following recurrence relation:

\[ \beta(n) \leq 2\beta(n/4) + O(1) \]
This implies $\beta(n) \in O(\sqrt{n})$.
Therefore, the complexity of range search in kd-trees is $O(s + \sqrt{n})$.

Higher Dimensions¶

kd-trees for $d$-dimensional space:
- At the root, the point set is partitioned based on the first coordinate.
- At the subtrees of the root, the partition is based on the second coordinate.
- At depth $d - 1$, the partition is based on the last coordinate.
- At depth $d$, we start all over again, partitioning on the first coordinate.
Storage: $O(n)$
Height: $O(\log n)$
Construction time: $O(n \log n)$
Range search time: $O(s + n^{1 - 1/d})$
- This assumes that $d$ is a constant.

Range Trees¶

Towards Range Trees¶

Current limitations:
- Quadtrees and kd-trees are intuitive and simple.
- Both may be slow for range searches.
- Quadtrees can be space-inefficient.
New Idea: Range trees
- Tree of trees (a multi-level data structure)
  - Traditional nodes store key-value pairs and references to children and possibly the parent.
  - Range trees allow more information to be stored in each node.
  - Each node contains another binary search tree.
Benefits:
- Though space-wasteful, range trees enable much faster range searches.

2-dimensional Range Trees¶

Primary structure:
- A balanced binary search tree $T$ that stores $P$ and uses $x$-coordinates as keys.
Associate structure:
- Each node $z$ in $T$ has an associated structure $T_{ass}(z)$.
- $P(z)$: All points in the subtree of $z$ in $T$, including the point at $z$.
- $T_{ass}(z)$: Stores $P(z)$ in a balanced binary search tree using $y$-coordinates as keys.
- Note: The point of $z$ is not necessarily the root of $T_{ass}(z)$.

Range Tree Space Analysis¶

Primary tree $T$ uses $O(n)$ space.
Question: How many nodes do all associate trees together have?
- Point of $a$ is only in associate tree $T_{ass}(a)$.
- Point of $b$ is in associate trees $T_{ass}(a), T_{ass}(b)$.
- Point of $c$ is in associate trees $T_{ass}(a), T_{ass}(b), T_{ass}(c)$.
Key Insight:
- Point of $z$ is in associate tree $T_{ass}(u)$ if and only if $u$ is an ancestor of $z$ in $T$.
- Therefore, each point belongs to $O(\log n)$ associate trees.
- As a result, all associate trees together use $O(n \log n)$ space.
Conclusion: A range-tree with $n$ points uses $O(n \log n)$ space, which is tight for some primary trees.

Range Trees Operations¶

search: Search by $x$-coordinate in $T$.
insert:
- Insert point by $x$-coordinate into $T$.
- Walk back up to the root, inserting by $y$-coordinate in all associate trees $T_{ass}(z)$ of nodes $z$ on the path to the root.
delete: Analogous to insertion.
Problem: Ensuring balanced binary search trees.
- Using AVL-trees makes insert/delete slow due to rotations, which change $P(v)$ and require re-building $T_{ass}(v)$.
- Solution: Rebuild highly unbalanced subtrees.
- Runtime for insert/delete: $O(\log^2 n)$ amortized.
range-search:
- Search by $x$-range in $T$.
- For found points, search by $y$-range in associated trees.
Note: Must understand 1-dimensional range search in binary search trees.

BST Range Search recursive¶

**BST::range-search-recursive(r ← root, x₁, x₂)**
r: root of a binary search tree, x₁, x₂: search keys
Returns keys in subtree at r that are in range [x₁, x₂]

if r = NULL then return
if x₁ ≤ r.key ≤ x₂ then
    L ← BST::range-search-recursive(r.left, x₁, x₂)
    R ← BST::range-search-recursive(r.right, x₁, x₂)
    return L ∪ r.{key} ∪ R
if r.key < x₁ then
    return BST::range-search-recursive(r.right, x₁, x₂)
if r.key > x₂ then
    return BST::range-search-recursive(r.left, x₁, x₂)

Note: Keys are reported in in-order, i.e., in sorted order.

BST Range Search Re-phrased¶

Search for left boundary $x_1$: this defines path $P_1$.
Search for right boundary $x_2$: this defines path $P_2$.
The tree $T$ is divided into three sections:
- Outside the range.
- On the paths $P_1$ and $P_2$.
- Between the paths.

This classification of nodes will be essential for further analysis.

Boundary nodes: Nodes in paths $P_1$ or $P_2$
- For each boundary node, check if it is within the range.
Outside nodes: Nodes to the left of $P_1$ or right of $P_2$
- These nodes are outside the range and are not visited.
Inside nodes: Nodes to the right of $P_1$ and left of $P_2$
- Keep a list of the topmost inside nodes.
- All descendants of these nodes are within the range, so for a 1-dimensional range search, report them directly.

BST Range Search Analysis¶

Assuming a balanced binary search tree:

Path Searches:
- Search for path $P_1$: $O(\log n)$
- Search for path $P_2$: $O(\log n)$
Boundary Nodes:
- There are $O(\log n)$ boundary nodes, each taking $O(1)$ time.
Topmost Inside Nodes:
- Each topmost inside node $v$ requires $O(1)$ time.
- They are children of boundary nodes, thus taking $O(\log n)$ time.
Descendants:
- For 1D range search, also report descendants of topmost inside nodes.
- Sum of descendants across topmost inside nodes is $\leq s$, as their subtrees are disjoint. This takes $O(s)$ time.

Run-time for 1D Range Search: $O(\log n + s)$. While not faster overall, the concept of topmost inside nodes is critical for 2D range search.

Range Trees: Range Search¶

To perform a range search for $Q = [x_1, x_2] \times [y_1, y_2]$, follow these steps:

Range Search on $~~x~~$-coordinates: Perform a range search on the $x$-coordinates for the interval $[x_1, x_2]$ in the primary tree $T$: BST::range-search(T, x1, x2)
Boundary and Topmost Inside Nodes:
- Identify boundary and topmost inside nodes as usual.
Processing Boundary Nodes:
- For each boundary node, check if the corresponding point lies within region $Q$.
Processing Topmost Inside Nodes:
- For each topmost inside node $v$:
  - Let $P(z)$ represent points in the subtree of $z$ in $T$.
  - Since all $x$-coordinates in $P(z)$ are within range, proceed with $y$-coordinates.
  - Range Search on $~~y~~$-coordinates:
    - Use $T_{ass}(z)$, the associated tree of $z$, to find points where $y$-coordinates fall within $[y_1, y_2]$: BST::range-search(T_ass(z), y1, y2)

Range Trees: Range Search Run-time¶

$O(\log n)$ time to find boundary and topmost inside nodes in the primary tree.
There are $O(\log n)$ such nodes.
$O(\log n + s_v)$ time for each topmost inside node $v$, where $s_v$ is the number of points in $T_{\text{ass}}(v)$ that are reported.
Two topmost inside nodes have no common point in their trees:
- $\Rightarrow$ every point is reported in at most one associate structure
- $\Rightarrow \sum_{v \text{ topmost inside}} s_v \leq s$

Time for range search in range-tree is proportional to

\[ \sum_{v \text{ topmost inside}} (\log n + s_v) \in O(\log^2 n + s) \]

Range Trees: Higher Dimensions¶

Range trees can be generalized to $d$-dimensional space.
- Space: $O(n (\log n)^{d-1})$
- Construction time: $O(n (\log n)^d)$
- Range search time: $O(s + (\log n)^d)$

(Note: $d$ is considered to be a constant.)

Space/Time trade-off compared to kd-trees.

Conclusion¶

Range search data structures summary¶

Quadtrees
- Simple (also for dynamic set of points)
- Work well only if points are evenly distributed
- Wastes space for higher dimensions
kd-trees
- Linear space
- Range search time $O(\sqrt{n} + s)$
- Inserts/deletes destroy balance and range search time (no simple fix)
Range-trees
- Range search time $O(\log^2 n + s)$
- Wastes some space
- Inserts/deletes destroy balance (can fix this with occasional rebuild)

Convention:

Points on split lines belong to the right/top side.