Module 3: Introduction to SQL¶
Relational Schemata and Conjunctive Queries¶
SQL (Structured Query Language)¶
The DDL for defining relational schemata:
- tables and basic integrity constraints;
- general integrity constraints and views;
- event/condition/action (ECA) triggers; and
- authorizations via a data control sublanguage (DCL).
The DML:
- a relationally complete query language;
- a fine grained data revision or update language; and
- protocols for embedding DML requests in application code
create table AUTHOR (
aid integer not null,
name varchar(10) not null,
primary key (aid) )
create table PUBLICATION (
pubid integer not null,
title varchar(25) not null,
primary key (pubid) )
create table WROTE (
author integer not null,
publication integer not null,
primary key (author, publication),
foreign key (author) references AUTHOR,
foreign key (publication) references PUBLICATION )
Note
❗
Concept Explanation from ChatGPT:
Technically, in SQL, if you don't specify the column in the REFERENCES clause, the foreign key will default to the primary key of the referenced table.
Include four very common varieties of integrity constraints:
- data type constraints for each column;
not nullconstraints (strongly desired for each column);primary keyconstraints; andforeign keyconstraints.
Note: SQL is not case sensitive.
SQL Data Types and the Relational Universe¶
The basic data types are as follows:
| Data Type | Description |
|---|---|
integer |
integer (32 bit) |
smallint |
integer (16 bit) |
decimal(m,n) |
fixed decimal |
float |
IEEE float (32 bit) |
char(n) |
character string (length \(n\)) |
varchar(n) |
variable length string (at most \(n\)) |
date |
year/month/day |
time |
hh:mm:ss.ss |
A domain constraint in the form of an SQL data type must be given for each attribute.
SQL Blocks¶
Syntax:
SELECT DISTINCTFROMWHERE
Allows formulation of conjunctive \((\exists, \land)\) RC queries of the form:
- a conjunction of
with specifies values in the resulting tuples are variables not used in
SQL uses corelations (tuple variables) and attributes to assign default variable names to components of tuples. In SQL, R as p (where the keyword as is optional) stands for \(R(p.a_1,...,p.a_k)\) in RC, where \(a_1, \ldots, a_k\) are the attribute names declared for \(R\).
Relations can serve as their own corelations when unambiguous.
select distinct title
from publication, book
where publication.pubid = book.pubid
Here, publication stands for publication (as) publication, i.e., publication(publication.pubid, publication.title).
FROM Clause Summary
Syntax:
- \(R_i\) are relation (table) names.
- \(n_i\) are distinct identifiers.
- The clause represents a conjunction \(R_1 \land \ldots \land R_k\).
- all variables of \(R_i\)’s are distinct
- we use (co)relation names to resolve ambiguities
- Cannot appear alone.
- only as a part of the select block
The result of running the query SELECT * FROM t1, t2 is a Cartesian product (or cross join) of the two tables t1 and t2.
The SELECT Clause
Syntax:
Operate as follows:
-
Eliminate superfluous attributes and remaining duplicates from answers (\(\exists\));
Note
❗
Concept Explanation from ChatGPT:
This step filters out any attributes (columns) that are not required in the final output of the query. It ensures that only relevant data as specified in the select statement is processed. After reducing the data to the necessary attributes, it further eliminates any duplicate rows that may exist, ensuring that each row in the result is unique.
-
Evaluate expressions \(e_i\) (here, built-in functions can be applied to values of attributes); then
Note
❗
Concept Explanation from ChatGPT:
The specified expressions \(e_i\) are evaluated for each row of data. These expressions can include arithmetic operations, string functions, date calculations, or any built-in SQL functions that modify, calculate, or transform data values. This step is crucial for generating the final computed values that you want in the result set.
-
Give names \(n_i\) to expression values in the answer.
Note
❗
Concept Explanation from ChatGPT:
This is essentially the renaming of the output columns in the result set. By using the
ASkeyword, you can assign meaningful names to the results of expressions evaluated in the previous step. This makes the output more readable and understandable, particularly when the expressions are complex or not self-explanatory.
Standard Expressions
We can create values in the answer tuples using built-in functions:
- On numeric types:
+,-,*,/, ... (usual arithmetic)
- On strings:
||(concatenation),substr, ...
- Constants (of appropriate types)
"SELECT 1"is a valid query in SQL-92
- UDF (user-defined functions)
Note: All attribute names must be present in the FROM clause (Concept Explanation from ChatGPT: it refers to an important rule in SQL syntax which ensures that any column or attribute referenced in the SELECT part of a query must be derived from a table specified in the FROM clause.).
Note
📌
A general principle: results of queries should look the same as instances of tables. Thus, query result columns should have attributes as well.
Note
📌
Good programming practice: always ensure an attribute name n is given for each select expression \(e\), appending as n if necessary.
The WHERE clause
Syntax:
Additional conditions on tuples that qualify for the answer.
- Standard atomic conditions:
- equality: =, != (on all types)
- order: <, <=, >, >=, <> (on numeric and string types)
- Conditions may involve expressions.
- as with expressions in the
SELECTclause
- as with expressions in the
Boolean Connectives
Atomic conditions can be combined using boolean connectives:
- AND (conjunction)
- OR (disjunction)
- NOT (negation)
Set Operations and First Order Queries¶
Set Operations¶
Answers to SELECT blocks are relations (sets of tuples).
- can apply set operations on them
Set union: Q1 UNION Q2.
- the set of tuples in Q1 or in Q2
- used to express “or”
Set difference: Q1 EXCEPT Q2.
- the set of tuples in Q1 but not in Q2
- used to express “and not”
Set intersection: Q1 INTERSECT Q2.
- the set of tuples in both Q1 and Q2
- used to express “and” (redundant, rarely used)
Q1 and Q2 must have union-compatible signatures (same number and types of attributes).
Naming Queries and Subqueries¶
Nest set operation inside a select block.
- common table expressions
Assignment
Queries denote relations. SQL provides a naming mechanism to assign names to (results of) queries.
- can be used later in place of (base) relations
Syntax
WITH T1 [<opt-schema-1>] AS ( <query-1-goes-here> ),
...
Tn [<opt-schema-n>] AS ( <query-n-goes-here> )
<query-that-uses-T1-to-Tn-as-table-names>
Note
❗
Concept Explanation from ChatGPT:
When you specify column names in parentheses right after the CTE name, you are essentially naming the columns for the resulting temporary table (the CTE) that the expression will produce. This is not about selecting which columns to show from the original tables, but rather about defining the names of the columns in the output of the CTE.
Specifying Column Names
In the context of your example with the CTE:
WITH bookorjournal (pubid) AS (
(SELECT DISTINCT pubid FROM book)
UNION
(SELECT DISTINCT pubid FROM journal)
)
Here, (pubid) is specifying that the column in the bookorjournal CTE will be named pubid. This is useful because:
- It directly tells anyone looking at the query that
bookorjournalwill contain a column namedpubid. - It ensures that whatever the output of the combined SELECT statements (from
bookandjournal), the resultant column is uniformly known aspubidin subsequent parts of the query.
What It Does Not Do
- It does not filter columns from the source tables: The statement isn't used to decide which columns from the
bookorjournaltables are visible or used beyond the scope of the CTE. -
It does not select columns to display in the final output of the main query: The final output columns are determined by the main SELECT statement outside the CTE. For example, in the main query: The
SELECT DISTINCT titledetermines that the output will show thetitlecolumn from thepublicationtable.SELECT DISTINCT title FROM publication, bookorjournal WHERE publication.pubid = bookorjournal.pubid
Why Use This Naming Convention?
Naming columns in a CTE helps in managing complex queries, particularly when:
- The CTE involves expressions or multiple columns being combined or transformed into a single output column.
- The query involves multiple CTEs or joins where maintaining clear, non-ambiguous column names simplifies understanding and managing the query logic.
So, the specification of column names in your CTE definition is about defining a clear, understandable schema for the data produced by the CTE, which is separate from selecting data to be displayed from your database tables.
SQL-92 permits inlining queries in the FROM clause:
<id>stands for the result of<query-here>- unlike for base relations,
is mandatory
An incorrect SQL query to compute all publication titles for journals or books:
select distinct title
from publication, book, journal
where publication.pubid = book.pubid
or publication.pubid = journal.pubid
// The result will be empty if book is empty, since their cross product is obtained
Nested Queries¶
Overview of WHERE Subqueries
- Presence/absence of a single value in a subquery:
<attr> IN ( <query> )<attr> NOT IN ( <query> )
- Relationship of a value to some/all values in a subquery:
<attr> op SOME ( <query> )<attr> op ALL ( <query> )
- Emptiness/non-emptiness of a subquery:
EXISTS ( <query> )NOT EXISTS ( <query> )
In the first two cases, <query> must be unary.
Nesting in the WHERE clause is mere syntactic sugar:
select r.b
from r
where r.a in (
select b
from s
)
select r.b
from r, (
select distinct b
from s
) as s
where r.a = s.b
Note:
<attr> = some (<query>)” ≡ “<attr> in (<query>)<attr> <> all (<query>)” ≡ “<attr> not in (<query>)
Aggregate Queries¶
Syntax:
SELECT x1,...,xk, agg1 [[AS] n1],..., aggj [[AS] nj]
<FROM-WHERE>
[GROUP BY x1,...,xk]
Restrictions:
- All attributes in the
SELECTclause that are not in the scope of an aggregate function must appear in aGROUP BYclause. aggiare of the formcount(*),count(<expr>),sum(<expr>),min(<expr>),max(<expr>), oravg(<expr>), where<expr>is usually an attribute ofQ(and usually not in theGROUP BYclause).
Operational Reading¶
- Partition the result of
<FROM-WHERE>into the smallest number of groups, with each group having equal values of the grouping attributes, or into a single group if there is noGROUP BYclause. - On each of these partitions, apply the aggregate functions.
- For each group, add a tuple with the grouping attribute values and the results of the aggregate functions to the result.
- (Always a good idea to name the results of the aggregate functions in the SELECT clause.)
The HAVING Clause¶
- The
WHEREclause cannot impose conditions on values of aggregatesWHEREconditions are applied beforeGROUP BY
- SQL introduces a
HAVINGclause to do this.- like
WHERE, but for aggregate values . . .
- like
- The aggregate functions used in the
HAVINGclause may be different from those in theSELECTclause; the grouping, however, is common.
The HAVING clause is just syntactic sugar, and can be replaced by a nested query and a WHERE clause.
Transactions and Database Update¶
Three kinds of table update:
-
The SQL
INSERTcommand, for inserting a single constant tuple or each tuple in the result of a query; -
The SQL
DELETEcommand, for removing all tuples satisfying a condition; and -
The SQL
UPDATEcommand, for updating in place all tuples satisfying a condition
SQL Insert¶
Inserting a single tuple:
- adds tuple \((c_1, . . . , c_k)\) to table \(T\)
- \(c_i\) must be a value in the data type for attribute \(A\)
Inserting multiple tuples given by a query:
- adds each tuple computed by \(Q\) to table \(T\)
SQL Delete¶
Deletion using a condition:
- deletes all tuples that match
.
Deletion using cursors (later).
- available in embedded SQL
-
only way to delete one out of two duplicate tuples
Note
❗
Concept Explanation from ChatGPT:
Why?
- SQL Works on Sets, Not Individual Rows! SQL is inherently set-based, meaning operations like
DELETEact on all rows that match a condition. If you attempt to delete rows with duplicate values without additional mechanisms, the query will delete all rows matching the condition (both duplicates). - Lack of Built-In Mechanism for Tuple Identification. When two rows are exact duplicates, there is no unique identifier (like a
PRIMARY KEYorROWID) to distinguish between them. Without such an identifier, SQL treats both rows as indistinguishable, making it impossible to delete just one without additional logic. - Cursors Provide Row-by-Row Processing. A cursor allows row-by-row traversal of the result set, enabling you to:
- Examine each tuple individually.
- Decide whether to delete a specific tuple while leaving the other intact.
- SQL Works on Sets, Not Individual Rows! SQL is inherently set-based, meaning operations like
SQL Update¶
Two components:
-
SET, an assignment of values to attributes; and
-
WHERE, a search condition.
Syntax:
UPDATE T
SET <assignments>
WHERE <condition>
ACID
The DBMS guarantees noninterference (serializability) of all data manipulation of tables in a database instance within the scope of a transaction.
- Transaction starts with first access of the database until it sees:
COMMIT: make changes permanent,SQL> commitCommit complete.
ROLLBACK: discard changes,SQL> rollbackRollback complete.