Module 12: Database Tuning¶

Overview and Workload Modelling¶

Database Tuning¶

Database Tuning

Database tuning primarily entails adjusting database parameters and physical database design to address performance issues with transaction response time and throughput.

Choice of technology can also make it necessary to modify conceptual database design and application DML code to address such issues.

Note

❗

Concept Explanation from ChatGPT:

Throughput refers to the number of transactions or queries that a database system can process and complete in a given time period, typically measured in transactions per second (TPS) or queries per second (QPS).

An appreciation of application workload and resource bottlenecks is crucial to choosing parameters and to physical design.

Workload Description

A workload description contains:

the critical queries and their frequency,
the critical updates and their frequency, and
the desired performance goal for each query or update.

For each query:

which relations are accessed,
which attributes are retrieved, and
which attributes occur in selection/join conditions, and how selective are query conditions.
- What are selection/join conditions?

For each update:

the type of update and relations/attributes affected, and
which attributes occur in selection/join conditions, and how selective are query conditions.

Performance goals:

reducing the time needed to compute answers to a query, or
transaction throughput for transactions entailing updates.

Physical Database Design¶

Standard Physical Design

A standard physical design for a relational schema defines the following for each relation name \(R\):

A primary index for \(R\) materializing its extension as a concrete data structure.
Zero or more secondary indices for \(R\) materializing projections of the primary index for \(R\) as concrete data structures.

A materialization of a relation adds an additional record identifier (RID) attribute to the relation.

Secondary indices usually include the RID attribute of the primary index in their projection of \(R\).

Primary and secondary indices are a variety of materialized views.

Note

❗

Concept Explanation from ChatGPT: Why we use Secondary Indices?

Faster queries: Secondary indexes speed up searches on non-primary key columns by avoiding full table scans.
Flexibility: They enable efficient access to data based on non-key attributes.
Optimized filters: Queries with WHERE conditions on indexed columns execute faster.
Better performance for joins: Secondary indexes improve join efficiency when the indexed columns are used.
Support for multiple access patterns: They allow retrieval using different columns, enhancing query flexibility.

Materialized Views¶

Note

❗

Concept Explanation from ChatGPT: Materialized View

A materialized view is a database object that stores the precomputed result of a query physically on disk. Unlike a regular view, it does not recompute the query on access but requires periodic or manual refreshes to stay updated.

Key Features of Standard Physical Design

Primary Index:
- A concrete data structure that materializes the full table (relation) with a primary key.
Secondary Index:
- Additional indices that materialize projections (subsets of columns) from the primary index for optimized access.
- They typically include the record identifier (RID) to point to rows in the primary index.
Materialization of Relations:
- Adds an implicit RID attribute for each record to track its location.
Relation to Materialized Views:
- Primary and secondary indices are specialized forms of materialized views, as they store precomputed data for faster retrieval.

In brief: Materialized views store precomputed query results, and standard physical design organizes data using primary and secondary indices, both of which act as materialized views with added RIDs for efficient access.

Possible syntax in SQL:

CREATE VIEW <view-name> [AS] (<query>)
MATERIALIZED [AS <data-structure-choice>]

Declaring views to be materialized is not part of the SQL standard. Most RDBMSs also support an alternative CREATE INDEX ... syntax.

Example

Relation name

\[ \texttt{PROF/(pnum,lname,dept)} \]

with the standard physical design consisting of:

the Btree primary index on pnum called PROF-PRIMARY

create view PROF-PRIMARY as (select * from PROF)
materialized as BTREE with search key (pnum)

and a Btree secondary index on lname called PROF-SECONDARY

create view PROF-SECONDARY
as (select lname, rid from PROF-PRIMARY)
materialized as BTREE with search key (lname)

On Choice of Data Structure¶

A Btree data structure is just one of a number of possibilities an RDBMS might allow for <data-structure-choice>:

ISAM or VSAM file (static ordered indices),
an unsorted heap file,
a hash file,
R trees (and other multidimensional data structures),
an array (for memory resident indices),
…

Workloads exists and are common that favour any of these.

navigational applications and R trees
OLAP workloads, main memory arrays, and heap files

Records encoding tuples in an index can also be co-clustered with records encoding tuples for another index

On Co-Clustering Indices¶

Co-Clustering

Two indices are co-clustered if the data pages for the indices are in common, that is, if records encoding tuples in the indices are interleaved within the same data pages.

Note

❗

Concept Explanation from ChatGPT:

Co-clustering happens when two indices share the same data pages. This means that the data for both indices is stored together in the same physical pages on disk. As a result, the records related to both indices are mixed or interleaved within these shared pages.

Note

❗

Concept Explanation from ChatGPT:

Co-clustering:

Normally, indices are stored separately, and the database fetches data for queries by accessing individual indices.
With co-clustering, the physical storage layout ensures that records related to the same tuple from multiple indices are stored near each other on disk.
This reduces I/O costs because accessing one index can bring related data from another index into memory.

Co-clustering is useful when \((1, N)\) relationships exist between the records for the two indices.

Example: If a foreign key exists from an EMPLOYEE table to a DEPARTMENT table, each record for the primary index of EMPLOYEE might be co-clustered with its related DEPARTMENT record in the data pages of the DEPARTMENT primary index. Records for indices on other tables, e.g., PROJECTS and JOBS can in turn be respectively co-clustered with DEPARTMENT and EMPLOYEE.

Performance implications:

Can speed up joins, particularly foreign-key joins; and
Sequential scans of either relation become slower.

Note

❗

Concept Explanation from ChatGPT:

Sequential scans of either relation become slower when co-clustering is implemented because co-clustering interleaves records of multiple indices or relations within the same physical data pages. This optimization improves access patterns for related queries, but it can degrade performance for pure sequential scans. Here's why:

How Co-Clustering Affects Sequential Scans

Interleaved Storage:
- In co-clustering, records from different relations (or indices) are stored together on the same data pages.
- For example, if EMPLOYEE and DEPARTMENT records are co-clustered:
  - A single data page might contain some EMPLOYEE records and some DEPARTMENT records.
- This layout is optimized for queries that access related records across the two relations.
Sequential Scans Expect Contiguous Data:
- A sequential scan involves reading all records of a relation by scanning its data pages sequentially.
- If records of a relation (e.g., EMPLOYEE) are scattered across data pages due to co-clustering, the database has to skip over unrelated records (e.g., DEPARTMENT records), leading to inefficiencies.
Increased I/O Overhead:
- Sequential scans now require additional effort to filter out unrelated records during the scan.
- The interleaved records also reduce the spatial locality of data, which means the database has to read more data pages to access all records of a single relation.

Example

Without Co-Clustering:

EMPLOYEE records are stored contiguously across pages:

Page 1: EMPLOYEE 1, EMPLOYEE 2, EMPLOYEE 3
Page 2: EMPLOYEE 4, EMPLOYEE 5, EMPLOYEE 6

A sequential scan of EMPLOYEE reads all records efficiently in one pass through its data pages.

With Co-Clustering:

EMPLOYEE and DEPARTMENT records are interleaved:

Page 1: EMPLOYEE 1, DEPARTMENT 1, EMPLOYEE 2
Page 2: DEPARTMENT 2, EMPLOYEE 3, DEPARTMENT 3

A sequential scan of EMPLOYEE now has to:
- Read both pages.
- Skip over the DEPARTMENT records, increasing processing time and I/O.

When This Trade-Off Occurs

Co-clustering benefits:
- Queries that access related records across relations, such as joins, benefit significantly.
- Example: SELECT * FROM EMPLOYEE JOIN DEPARTMENT ON DeptID.
Sequential scans are slower:
- Queries that scan all records from a single relation without considering the related relation.
- Example: SELECT * FROM EMPLOYEE.

How to Mitigate This Trade-Off

Choose Co-Clustering Wisely:
- Use co-clustering only for relations with frequent joins or queries involving related indices.
- Avoid co-clustering for tables frequently accessed individually through sequential scans.
Partitioning:
- Partition tables or indices in a way that minimizes unnecessary interleaving.
Parallel Processing:
- Sequential scans can be parallelized across multiple processors to reduce the impact of interleaving.

Conclusion

Sequential scans become slower with co-clustering because interleaving records reduces the spatial locality of data for a single relation. While co-clustering is beneficial for queries involving related data across indices or relations, it introduces overhead for queries that sequentially scan one relation. The trade-off between these two performance goals must be carefully considered during database design.

On Search in Ordered Indices¶

Ordered indices allow more general subqueries to be evaluated efficiently.

range queries
- B-trees can also help evaluating a query with the form
```
select * from <table>
where <attribute> >= <constant>
```
If the search key of a Btree is on <attribute>, it can be used to efficiently locate tuples where <attribute> = <constant>. The remaining records that qualify are obtained by scanned the remaining data pages.
multi-attribute search keys
- It is usually possible to create an index on several attributes of the same relation.

Note that the order in which attributes appear in the search key is important.

Beyond Standard Physical Design¶

join indices
- In standard physical design, a secondary index is a materialized view on a primary index.
- Some systems support a materialized view defined by a conjunctive query on more than one primary index.
Example: A join index for the bibliography database:
```
create view AUTHOR-BOOK as (
select author-rid, book-rid
form AUTHOR-PRIMARY, WROTE-PRIMARY, BOOK-PRIMARY
where aid = author and publication = pubid )
materialized as heap file
```
arbitrary materialized views
- Some systems support a materialized view defined by an arbitrary SQL query.

Query Optimization Revisited¶

View Based Query Rewriting

Given a query \(Q\) and a physical database design consisting of a set of arbitrary materialized views \(\{V_1, \dots, V_n\}\), the view-based query rewriting problem is to find the most efficient query plan \(P\) equivalent to \(Q\) that uses only views \(V_i\).

View-based query rewriting is also a fundamental problem in data integration, where data sources are defined via the local-as-view paradigm.

Theorem

View-based query rewriting is undecidable, even when \(Q\) and each \(V_i\) is a conjunctive query, and any plan \(P\) equivalent to \(Q\) is acceptable.

Note

❗

Concept Explanation from ChatGPT:

View-Based Query Rewriting is the process of rewriting a query \(Q\) using a set of materialized views \(\{V_1, V_2, \dots, V_n\}\) instead of accessing the base tables directly. The goal is to find a new query plan \(P\) that is equivalent to \(Q\) (produces the same results) but uses only the materialized views \(V_i\), potentially improving performance by leveraging precomputed data.

Materialized views are precomputed query results stored in the database. They are often used to optimize query execution by avoiding the need to recompute results from the base tables, especially for expensive operations like joins or aggregations.

For example, if \(Q\) is a query to retrieve information about employees in a department, and \(V_1\) is a materialized view of all employee-department relationships, the rewritten query \(P\) could retrieve the data directly from \(V_1\), bypassing the base tables.

This concept is particularly important in data integration systems, where multiple data sources are represented as views (local-as-view paradigm). In these cases, the query rewriting process enables accessing integrated data efficiently by using these views.

The challenge lies in finding the most efficient query plan \(P\), as the search space of possible rewritings can be large. The theorem states that view-based query rewriting is undecidable in general, meaning there is no algorithm that can always determine the correct rewritten query \(P\), even when \(Q\) and \(V_i\) are simple conjunctive queries. This highlights the computational complexity of the problem.

Query Plan Tools¶

An RDBMS will usually have a tool to enable one to investigate what plan is chosen for a query and what the estimated cost is.

For DB2, one can use db2expln and dynexpln.

Example: Invoking these tools on the bibliography query:

SELECT name
FROM author, wrote
WHERE aid = author;

generates the following:

Index Advising Tools¶

An RDBMS will usually also have tools to advise on a standard physical design given a workload description.

For DB2, one can use db2advis.

db2advis -d cs338 -s "select name from author,wrote where aid=author"

The above Bash commands outputs

Calculating initial cost (without recommended indexes) [25.390385]
Initial set of proposed indexes is ready.
Found maximum set of [2] recommended indexes
Cost of workload with all indexes included [0.364030] timerons
total disk space needed for initial set [ 0.014] MB
total disk space constrained to [ -1.000] MB
2 indexes in current solution
[ 25.3904] timerons (without indexes)
[  0.3640] timerons (with current solution)
[%98.57] improvement

Trying variations of the solution set.
--
-- execution finished at timestamp 2006-11-23-12.25.24.205770
--
-- LIST OF RECOMMENDED INDEXES
-- ===========================
-- index[1],  0.009MB
CREATE INDEX WIZ8 ON "DAVID"."AUTHOR" ("AID" ASC, "NAME" ASC);
-- index[2],  0.005MB
CREATE INDEX AW ON "DAVID"."WROTE" ("AUTHOR" ASC);
--
Index Advisor tool is finished.

NOTE: db2advis can be given a more general workload.

Physical Design Guidelines¶

Don’t index unless the performance increase outweighs the update overhead.
Attributes mentioned in WHERE clauses are candidates for index search keys.
Multi-attribute search keys should be considered when:
1. A WHERE clause contains several conditions; or
2. It enables index-only plans.
Choose indexes that benefit as many queries as possible.
Each relation can have at most one primary index; therefore, choose it wisely.
1. Target important queries that would benefit the most:
  - Range queries benefit the most from clustering.
  - Join queries benefit the most from co-clustering.
2. A multi-attribute index that enables an index-only plan does not benefit from being clustered.

Logical Schema and Query Tuning¶

Schema Tuning and Normal Forms¶

If performance issues persist despite physical database design tuning, it may be necessary to adjust the conceptual database design and tune queries.

Goals:

Avoiding expensive operations in query execution (e.g., joins).
Retrieving related data in fewer operations.

Techniques:

Use alternative normalization or weaker normal forms.
Apply co-clustering (if supported) and denormalization.
Vertically and horizontally partition data and materialized views.
Avoid concurrency hot-spots.

Note

❗

Concept Explanation from ChatGPT:

Hot spots in the context of databases refer to specific data items, rows, or pages that are accessed or modified frequently by multiple concurrent transactions. These hot spots can create performance bottlenecks and increase contention, leading to delays, deadlocks, or reduced throughput.

Tuning the Conceptual Schema¶

Adjustments can be made to the conceptual schema:

Re-normalization

Consider alternative BCNF decompositions that better fit the queries in the workload.
Effects of (re) normalization:
- Speeds up simple updates (fewer change anomalies).
- Speeds up simple queries (smaller tables).
- Slows down complex queries (more joins).
- Slows down complex updates (if they involve complex queries).

Denormalization

Consider merging relational schemata to intentionally increase redundancy.
Effects of redundancy:
- Increases update overhead (due to change anomalies).
- Decreases query overhead.

Very large tables can lead to performance bottlenecks. Partitioning involves splitting a table into multiple smaller tables to reduce I/O cost or lock contention.

Horizontal Partitioning

Each partition contains all the original columns but only a subset of the original rows.
Tuples are assigned to a partition based on specific criteria (e.g., natural divisions like date or region).
Often used to separate operational data from archival data.

Vertical Partitioning

Each partition contains a subset of the original columns but all the original rows.
Typically used to:
- Separate frequently-used columns (to reduce concurrency issues or lock contention).
- Isolate infrequently-used columns for efficiency.

Tuning Queries¶

Changes to physical or conceptual schemas affect all queries and updates in the workload. However, sometimes it is necessary to target the performance of specific queries or applications.

Guidelines for tuning queries:

Minimize Sorting Costs:
- Sorting is expensive. Avoid unnecessary uses of ORDER BY, DISTINCT, or GROUP BY clauses.
Replace Subqueries with Joins:
- Subqueries can often be replaced with joins for better performance.
Uncorrelate Subqueries:
- Replace correlated subqueries with uncorrelated ones to improve efficiency.
Leverage Vendor Tools:
- Use vendor-supplied tools to examine query plans.
- Update or create statistics if poor plans are caused by incorrect cost estimation.

Tuning Applications¶

Guidelines for tuning applications:

Minimize communication costs:
1. Return only the necessary columns and rows.
2. Use a WHERE clause to update multiple rows instead of using a cursor. (suggests using a set-based approach (SQL's native ability to work on multiple rows simultaneously) instead of a row-by-row (cursor-based) approach to update rows in a database.)
Minimize lock contention and hot-spots:
1. Delay updates for as long as possible.
2. Delay operations on hot-spots as much as possible.
3. Shorten or split transactions when possible.
4. Perform insertions, updates, and deletions in batches.
5. Consider using lower isolation levels to reduce contention.

Summary¶

Physical database design has a significant impact on performance.
Decisions require an understanding of what the DBMS is doing:
- Regarding query execution.
- Regarding transaction processing.
- Regarding query optimization.
Modern systems provide many tools to assist.
DBMS technology remains an active area of research.