Module 4: Advanced SQL¶

General Integrity Constraints and Views¶

Syntax:

CREATE ASSERTION <assertion-name>
CHECK (<condition>)

Views in SQL¶

Syntax:

CREATE VIEW <view-name> [AS] (<query>)

A view has many of the same properties as a table.

Its definition appears in the database schema;
Access controls can be applied to it;
Other views can be defined in terms of it; and
It can be queried as if it were a table.

Unlike tables, only some views are updatable.

Updating Views¶

Modifications to a view’s instance must be propagated to tables and other views referenced by the view query.

View Updates in SQL¶

According to SQL-92, a view is updatable only if its definition satisfies a variety of conditions:

The query references exactly one table;
The query only outputs simple attributes (no expressions);
There is no grouping/aggregation/DISTINCT;
There are no nested queries; and
There are no set operations.

The rules ensure that updates to underlying tables and views are implicitly defined by their old contents and by updates to the view itself. Determining implicit definability in general is undecidable.

On Multiset Semantics¶

SQL has always had a more general multiset or bag semantics that allows duplicates, in contrast to the simpler set semantics of RC:

Note

A multiset (also called a bag) is a generalization of a set in which members are allowed to appear more than once. Unlike a regular set, which enforces uniqueness, a multiset allows duplicate elements but may track their multiplicity (how many times each element occurs).

SQL tables, views and query results are multisets of tuples
originally adopted for efficiency reasons

A multiset semantics is assumed for existential quantification (\(\exists\)) in the RC query.

An additional kind of condition “elim \(\varphi\)” removes duplicates.

Definition (Range Restricted RC with Duplicate Elimination)

Given a database signature \(\rho = (R_1/k_1, \ldots, R_n/k_n)\), a set of variable names \(\{x_1, x_2, \ldots\}\) and a set of constants \(\{c_1, c_2, \ldots\}\), range restricted conditions are formulas defined by the grammar:

\[ \begin{aligned} \varphi ::= &\ R_i(x_{i,1}, \ldots, x_{i,k_i}) \\ &\mid \varphi_1 \wedge (x_i = x_j) \quad \text{where } \{x_i, x_j\} \cap Fv(\varphi_1) \neq \emptyset \\ &\mid x_i = c_j \\ &\mid \varphi_1 \wedge \varphi_2 \\ &\mid \exists x_i.\varphi_1 \\ &\mid \varphi_1 \vee \varphi_2 \quad \text{where } Fv(\varphi_1) = Fv(\varphi_2) \\ &\mid \varphi_1 \wedge \neg \varphi_2 \quad \text{where } Fv(\varphi_1) = Fv(\varphi_2) \\ &\mid \text{elim } \varphi \end{aligned} \]

A range restricted RC query has the form \(\{(x_1, \ldots, x_n) \mid \varphi\}\), where \(\{x_1, \ldots, x_n\} = Fv(\varphi)\) and where \(\varphi\) is a range restricted condition.

Note: No direct access to the repetition counts is possible.

Semantics (Multiset Semantics for Range Restricted RC)

A formula \(\varphi\) over a signature \(\rho = (R_1/k_1, \ldots, R_n/k_n)\) is true \(m\) times respect to database instance \(\textbf{DB} = (\textbf{D}, \approx, \textbf{R}_1, \ldots, \textbf{R}_n)\) and valuation \(\theta : \{x_1, x_2, \ldots\} \rightarrow \textbf{D}\), written \(\textbf{DB}, \theta, m \models \varphi\), as given by the following:

\[ \begin{aligned} &\textbf{DB}, \theta, 0 \models \textbf{R}_i(x_{i,1}, \ldots, x_{i,k_i}) && \text{if } (\theta(x_{i,1}), \ldots, \theta(x_{i,k_i}), m) \not\in \textbf{R}, \text{ for any } m > 0 \\ &\textbf{DB}, \theta, m \models \textbf{R}_i(x_{i,1}, \ldots, x_{i,k_i}) && \text{if } (\theta(x_{i,1}), \ldots, \theta(x_{i,k_i}), m) \in \textbf{R} \\ &\textbf{DB}, \theta, m \models \varphi \wedge (x_i = x_j) && \text{if } \textbf{DB}, \theta, m \models \varphi \text{ and } \theta(x_i) \approx \theta(x_j) \\ &\textbf{DB}, \theta, 1 \models (x_i = c_j) && \text{if } \theta(x_i) \approx c_j \\ &\textbf{DB}, \theta, m_1 \cdot m_2 \models \varphi \wedge \psi && \text{if } \textbf{DB}, \theta, m_1 \models \varphi \text{ and } \textbf{DB}, \theta, m_2 \models \psi \\ &\textbf{DB}, \theta, \sum_{v \in \textbf{D}} m_v \models \exists x_i. \varphi && \text{if } \textbf{DB}, \theta[x := v], m_v \models \varphi \\ &\textbf{DB}, \theta, m_1 + m_2 \models \varphi \vee \psi && \text{if } \textbf{DB}, \theta, m_1 \models \varphi \text{ and } \textbf{DB}, \theta, m_2 \models \psi \\ &\textbf{DB}, \theta, \max(0, m_1 - m_2) \models \varphi \wedge \neg \psi && \text{if } \textbf{DB}, \theta, m_1 \models \varphi \text{ and } \textbf{DB}, \theta, m_2 \models \psi \\ &\textbf{DB}, \theta, 1 \models \text{elim } \varphi && \text{if } \textbf{DB}, \theta, m \models \varphi, \text{ where } m > 0 \end{aligned} \]

The answers to a query \(\{(x_1, \ldots, x_n) \mid \varphi\}\) over \(\textbf{DB}\) is given as follows, when \(\{x_1, \ldots, x_n\} = Fv(\varphi)\):

\[ \{(\theta(x_1), \ldots, \theta(x_n), m) \mid m > 0 \wedge \textbf{DB}, \theta, m \models \varphi\}. \]

The Basic “SELECT Block” Syntax:

SELECT [DISTINCT]
FROM
WHERE

Allows formulation of conjunctive \((\exists, \land)\) range restricted RC queries with multiset semantics of the form:

\[ \{\texttt{<results>} \mid \texttt{[elim] }\exists\texttt{<unused>}.(\bigwedge \texttt{<tables>}) \land \texttt{<condition>}\}. \]

a conjunction of <tables> with <condition>
<results> specifies values in the resulting tuples
<unused> are variables not used in <results>
duplicates eliminated from result if DISTINCT included

Multiset Operations¶

Answers to SELECT blocks are multisets of tuples.

can apply multiset operations on them
Multiset union: \(Q_1 \texttt{ UNION ALL } Q_2\).
- behaves like "\(\vee\)" in range restricted RC with multiset semantics
Multiset difference: \(Q_1 \texttt{ EXCEPT ALL } Q_2\).
- behaves like "\(\wedge\)" in range restricted RC with multiset semantics
Multiset intersection: \(Q_1 \texttt{ INTERSECT ALL } Q_2\).
- maximum number of tuples common to \(Q_1\) and \(Q_2\)
- rarely used

Note

Operation	Duplicates Considered?	Key Rule
`UNION ALL`	Yes	Combines all rows from both queries, including duplicates.
`INTERSECT ALL`	Yes	Returns only rows common to both queries, with multiplicities equal to the minimum in both.
`EXCEPT ALL`	Yes	Returns rows in the first query that are not in the second, subtracting duplicates.

\(Q_1\) and \(Q_2\) must again have union-compatible signatures (same number and types of attributes).

On subqueries with duplicates:

Such subqueries occurring in WHERE clauses will not change the results computed by the top-level query.
Such subqueries occurring in WITH or FROM clauses can change the results computed by the top-level query.

Null Values¶

What is a “null” value?

Represents the absence of a value
Value inapplicable
- Essentially poor schema design.
Value unknown
Note

Unknown values can be replaced by any domain value (that satisfies integrity constraints).
- many possibilities (possible worlds)

How do we answer queries?

Answers true in all possible worlds \(\mathbf{W}\) of an incomplete instance \(\mathbf{D}\).

Certain Answer:

\[ Q(\mathbf{D}) = \bigcap_{\mathbf{W} \text{ world of } \mathbf{D}} Q(\mathbf{W}) \]

Answer common to all possible worlds.

Note

❗

Concept Explanation from ChatGPT:

Here's a breakdown:

Possible Worlds \((\textbf{W})\): In the context of incomplete or uncertain data, each "world" represents one possible complete version of the data. Each possible world \(\textbf{W}\) is an interpretation of the data instance \(\textbf{D}\) where missing or uncertain values are filled in different ways.
Certain Answer: A certain answer to a query \(Q(\textbf{D})\) on an incomplete instance \(\textbf{D}\) is an answer that is true in all possible worlds. This is computed by taking the intersection of answers across all worlds:

\[ Q(\textbf{D}) = \bigcap_{\textbf{W} \text{ world of } \textbf{D}} Q(\textbf{W}) \]

This expression means that the result of \(Q(\textbf{D})\) includes only those answers that would be true no matter how the incomplete data \(\textbf{D}\) is completed.
Feasibility: The slide notes that finding the certain answer is computationally challenging. Determining an answer common to all possible worlds is often NP-hard or undecidable except for trivial cases, as it requires evaluating the query across a potentially infinite number of possible worlds.
SQL's Solution: Since it’s impractical to compute certain answers directly, SQL typically provides an approximate answer. SQL engines do not consider all possible worlds; they usually provide answers based on the data available, ignoring the uncertainty in missing data.

Is this (computationally) feasible?

No; NP-hard to undecidable except in trivial cases.

SQL’s solution?

An approximation.

What can we do with NULLs in SQL?

expressions
- General rule: a NULL as a parameter to an operation makes the result NULL.
- 1 + NULL → NULL, ’foo’||NULL → NULL, etc.
predicates/comparisons
- Three-valued logic. approximation of value unknown
set operations
- Unique special value for duplicates.
aggregate operations
- Do not “count” (i.e., “value inapplicable”).

Note

Comparisons with a NULL value return UNKNOWN

1 = 1 TRUE
1 = NULL UNKNOWN
1 = 2 FALSE

Boolean operations need to defined for UNKNOWN.

extended truth tables for Boolean connectives

UNKNOWN in WHERE Clauses

How is this used in a WHERE clause?

Additional syntax IS TRUE, IS FALSE, and IS UNKNOWN.
- WHERE <cond> is shorthand for WHERE <cond> IS TRUE
Special comparison IS NULL

DB2 uses - to denote a NULL values.

Counting NULLs

COUNT(URL) counts only non-NULL URL’s.

COUNT(*) counts “rows”

Outer Join

Note

Allow “NULL-padded” answers that “fail to satisfy” a conjunct in a conjunction.

Extension of syntax for the FROM clause:

\[ \texttt{FROM } T1 \texttt{ <j-type> } \texttt{JOIN } T2 \texttt{ ON } \texttt{<cond>} \]

<j-type> is one of FULL, LEFT, RIGHT, or INNER

Semantics, for \(T_1 = R(x, y)\), \(T_2 = S(y, z)\), and \(\langle \text{cond} \rangle = (r.y = s.y)\):

\(\{(x, y, z) \mid (R(x, y) \land S(y, z))\}\) (INNER JOIN)
\(\vee ((z = \text{NULL}) \land R(x, y) \land \neg (\exists z. S(y, z))) \quad \text{(for LEFT and FULL)}\)
Note

❗

Concept Explanation from ChatGPT:

Intuition:

For a LEFT OUTER JOIN, every row in the left-hand table \(R\) must appear in the result:
- If a row in \(R(x, y)\) does not find a match in \(S(y, z)\) (i.e., no \(z\) exists for \(y\) in \(S\), NULL is placed in \(z\) for that unmatched row.
\(\vee ((x = \text{NULL}) \land S(y, z) \land \neg (\exists x. R(x, y))) \quad \text{(for RIGHT and FULL)}\)

Example:

select aid, publication
from author left join wrote on aid=author

-- Author ids and a count of the number of publications for each.
select aid, count(publication) as pcnt
from author left join wrote on aid = author
group by aid

Ordering and Limits¶

Ordering Results in SQL¶

No particular ordering on the rows of a table can be assumed when queries are written.
No particular ordering of rows of an intermediate result in the query can be assumed either.
However, it is possible to order the final result of a query with an ORDER BY clause at the end of the query.

Syntax:

\[ \texttt{ORDER BY } e_1[Dir_1],\ldots,e_k[Dir_k] \]

where \(Dir_i\) is either ASC or DESC

ASC is the default

Limiting the Number of Results¶

The number of results of a query can be limited by appending a LIMIT clause to a query.

Syntax, where \(e_i\) **is a numeric expression:

\[ \texttt{<query> LIMIT }e_1 \text{ }[\texttt{OFFSET } e_2] \]

Semantics is non-deterministic:

As many of the first \(e_1\) answers to a query that exist for any total order to which the results of <query> can be extended, if there is no OFFSET given.
Semantics becomes deterministic only when <query> has an ORDER BY clause inducing a total order. As many of the next \(e_1\) answers to a query that exist following the first \(e_2\) answers for any total order to which the results of <query> can be extended, if OFFSET is given.

Semantics becomes deterministic only when <query> has an ORDER BY clause inducing a total order.

Note

❗

Concept Explanation from ChatGPT:

A total order is a mathematical concept used to describe an ordering of elements within a set that satisfies certain properties, ensuring that every pair of elements can be compared to each other. It is also known as a linear order. The concept is extensively used in mathematics, computer science, and related fields, particularly when discussing sorting algorithms, database query results, and data structures.

Application in Computing and Databases

In the context of computing and databases, especially when dealing with sorting or organizing data:

Sorting Algorithms: Total order is essential for defining the sequence in which elements should appear in a sorted list. For instance, sorting a list of numbers from smallest to largest.
Database Queries: In SQL, the ORDER BY clause establishes a total order on the rows returned by a query based on specified columns. This is crucial for ensuring that the results are presented in a predictable and repeatable manner, especially when combined with LIMIT and OFFSET clauses for pagination purposes.

Syntax

LIMIT e1: This clause is used to restrict the number of rows returned by a query. e1 here is a numeric expression that specifies the maximum number of rows to return.
OFFSET e2: Optionally, you can specify an OFFSET e2, where e2 is also a numeric expression indicating the number of rows to skip before starting to return rows.

Semantics

Without OFFSET:
- If only LIMIT is used (LIMIT e1), the database returns up to e1 rows based on the internal order of rows as they are stored or processed. This order is not specified by the user and can vary depending on several factors like the database's internal architecture, indexing strategy, or how the data is physically stored.
With OFFSET:
- When OFFSET is used together with LIMIT (LIMIT e1 OFFSET e2), the database skips the first e2 rows and then returns the next e1 rows. Like the scenario without OFFSET, if no specific ordering is specified, which rows are skipped and returned can vary.

Deterministic vs. Non-Deterministic Results

Non-Deterministic: The results are considered non-deterministic when no ORDER BY clause is used. This means that the database does not have a specific order to follow while retrieving or displaying the results. Therefore, the actual rows returned by a LIMIT or OFFSET clause might differ between executions of the same query, especially if the database undergoes changes between these executions (like updates, inserts, or deletions).
Deterministic: The results become deterministic when an ORDER BY clause is included. ORDER BY forces the database to sort the results according to the specified columns before applying the LIMIT or OFFSET. This sorting ensures a consistent order, so every execution of the same query will yield the same results, assuming no changes to the underlying data.

Triggers and Authorization¶

Triggers in SQL¶

Syntax (for individual tuple events, and simplified):

CREATE TRIGGER <trigger-name>
AFTER <event> ON <table-or-view>
     [ REFERENCING OLD AS <corelation-old> ]
     [ REFERENCING NEW AS <corelation-new> ]
     FOR EACH ROW [ WHEN <condition> ] BEGIN ATOMIC
          <DML-action-1>;
          ...
          <DML-action-n>;
END

where <event> can be one of:

INSERT,
DELETE, or
UPDATE OF <attribute>

Note

❗

Concept Explanation from ChatGPT:

In SQL, the keyword ATOMIC in a trigger (or other SQL statements) refers to the atomicity property, which ensures that the entire block of statements inside the BEGIN ATOMIC ... END structure is treated as a single, indivisible unit of work. This means that either all statements within the block are successfully executed, or none of them are. If any statement within the atomic block fails, the entire block will be rolled back, leaving the database in a consistent state as if none of the statements in the block were executed.

Example:

create trigger delete_author \
after delete on author \
referencing old as a \
for each row begin atomic \
   delete from wrote where wrote.author = a.aid; \
end

Semantics (overview):

Implements event/condition/action (ECA) rules.
Adds additional operations to the ongoing transaction issuing the operation that triggered the execution of the rule.
Timing of integrity constraint checking must be carefully managed.
- deferring all to COMMIT time always works, but impacts performance
Makes the SQL DML Turing complete.

ECA Rules in Foreign Keys

Special syntax exists for common rules related to foreign key constraints:

FOREIGN KEY ( <from-attribute-list> )
REFERENCES <table> [ ( <to-attribute-list> > ) ]
     [ ON DELETE <action> ]
     [ ON UPDATE <action> ]

where <action> can be:

RESTRICT (produce an error),
CASCADE (propagate the delete), or
SET NULL (set to “unknown”).

Example:

create table WROTE (
     author integer not null,
     publication integer not null,
     primary key (author, publication),
     foreign key (author) references AUTHOR
          on delete cascade,
     foreign key (publication) references PUBLICATION )

Concept Explanation from ChatGPT:

ON DELETE CASCADE: This is a referential action related to the foreign key. The CASCADE keyword dictates that when a referenced record in the parent table (AUTHOR) is deleted, then all the records in the child table that have a foreign key matching that primary key value should also be deleted automatically.

Authorization in SQL¶

The SQL DML includes a data control language (DCL) to manage access rights to database objects by users and groups of users.

Syntax (simplified):

GRANT  <role> TO <user>
GRANT  <what> ON <object> TO <user-or-role>
REVOKE <role> FROM <user>
REVOKE <what> ON <object> FROM <user-or-role>

where “<what> ON <object>” can be:

DATABASE -- object
    CONNECT -- what
TABLE or VIEW -- object
    ALTER, REFERENCES, SELECT, INSERT, DELETE, or UPDATE -- what

... and where CREATE ROLE <role> commands create new roles.

role \(\texttt{DBADM}\) is a superuser (short for Database Administrator)

Example:

-- Create the PAT role, and grant ability to access the PAYROLL database to PAT.
create role PAT
grant connect on PAYROLL to PAT
-- Grant ability to query table EMPLOYEE to the payroll project team.
grant select on EMPLOYEE to PAT
-- Add Jim to the payroll project team.
grant PAT to Jim