CS258 - Database Systems

The relational model

Informally, a relation is a table of values with a set of rows. The data elements in each row represent certain facts that correspond to a real-world entity or relationship.

Each column represents a characteristic of interest of an entity. It has a header that gives an indication of the meaning of the data items in that column.

Keys are used to uniquely identify each row in the table. An auto-incrementing primary key can be called an artificial or surrogate key.

Formal definitions

The relational database is a collection of relations (tables) with known names and attribute pairs (columns). An attribute pair consists of name and data type.

The schema of a relation is denoted by $R(A_1, A_2, ..., A_n)$ where $R$ is the name of the relation and $A_n$ are the attributes of the relation. An example of a relation is CUSTOMER(id, name, address, phone).

Each attribute has a domain, which is the set of valid values. For example, the domain of phone number is integers of length 11.

A tuple (row) of a relation (table) is an ordered set of values. Tuples are written between angled brackets, for example of a tuple in the CUSTOMER relation is <13234, "A. Name", "123 Road, City", 07000123456>.

The relation state is the set of tuples currently in the relation. It is a subset of the cartesian product of the domain of the attributes.

Formal notation

Given a relation $R(A_1, A_2, ..., A_n)$:

  • the schema of the relation is $R(A_1, A_2, ..., A_n)$
  • $R$ is the name of the relation
  • $A_1, A_2, ..., A_n$ are the attributes of the relation
  • $r(R)$ is a specific state of relation $R$. This is the set of tuples in the specific state.
  • $r(R) = \{t_1, t_2, ..., t_n\}$ where $t_i$ is an $m$-tuple
  • $t_i =$ <$v_1, v_2, ..., v_m$> where $v_j$ comes from the domain of $A_j$
  • $r(R) \subset d(A_1) \times d(A_2) \times ... \times d(A_n)$ where $d(A_k)$ is the domain of $A_k$.

Other stuff

A collection of relation schemas is called the relational database schema, also called intension of the database.

The relation states for the corresponding schema are called a relational database or the extension of the database.

More simply, the intension is the structure, the extension is the data.

Constraints

Constraints are used to determine the permissible states of relation instances. Three explicit constraints are used:

  • key constraints
  • entity integrity constraints
  • referential integrity constraints There are also domain constraints. Values in the tuple must come from the domain of that attribute or NULL.

A superkey, $SK$, of a relation $R$ is a subset of the attributes of $R$ such that in any valid state, $r(R)$ for two distinct tuples $t_1, t_2 \in r(R)$, $t_1[SK] \neq t_2[SK]$.

More simply, a superkey of a relation is a subset of attributes which can uniquely identify every tuple in the relation. This could be one attribute or several.

The candidate key of a relation is a superkey with the fewest attributes necessary. Any candidate key can then be used as a primary key.

Entity integrity

The primary key attributes can't be NULL in any tuple of r(R). This is because PK values are used to identify the individual tuples. If the PK has several attributes, null is not allowed as a value of any of those attributes. Any attribute of R may have a not null constraint. This is specified when the table is created.

Referential integrity

Referential integrity means that all references (foreign keys) are valid. The foreign key in one relation must match the primary key in another, or be NULL.

Insert violations

When inserting tuples into the relation, it is possible for the following constraints to be violated:

  • Domain - One of the attribute values for the new tuple is not in the attribute domain
  • Key - the value of a key attribute in the new tuple already exists
  • Referential integrity - a foreign key value in the new tuple references a primary key value that does not exist in the referenced relation
  • Entity integrity - if the primary key value is null in the new tuple

Deletion violations

Deletion can only violate referential integrity. This occurs if the primary key value of the tuple being deleted is referenced from other tuples in other relations. The method of resolving deletions leading to referential integrity violations must be defined for each foreign key when the database is designed.

Update violations

Constraint violations caused by updates are the same as those for insertion. Depending on the attributes being updated, some insert violations may not be applicable to updates.

Functional dependencies

Functional dependencies are constraints that specify the relationship between two sets of attributes, denoted $X \rightarrow Y$. This means $X$ is a set of attributes that are capable of determining $Y$.

Functional dependency example

Employee IDEmployee nameDepartment IDDepartment name
0001Bob1Human resources
0002Steve2Marketing
0003Melissa1Human resources
0004Harvey3Sales

The following functional dependencies can be identified:

  • Employee ID $\rightarrow$ {Employee name, Department ID, Department name}
  • Department ID $\rightarrow$ Department name

SQL

SQL is a declarative language. It is both a data definition language (DDL) for schemas, relations and constraints and a data manipulation language (DML) for updates and queries.

A database contains one or more schemas.

  • Schemas are created using CREATE SCHEMA.
  • Schemas can then be used to make tables.
  • Catalogues are a named collection of schemas.
  • All tables belong to a schema, which is public by default.

Referential integrity triggered actions

ON UPDATE, ON DELETE and CASCADE, SET NULL and SET DEFAULT can be used to decide what to do when referential integrity would be violated by updating or deleting a record.

Querying

SELECT DISTINCT can be used to select distinct records.

Database programming

Java Database Connectivity (JDBC) can be used to interact with a database in Java. The classes can be imported using

import java.sql.*;

Database drivers are loaded using Class.forName. To load the postgres driver, use

Class.forName("org.postgres.Driver");

The DriverManager.getConnection method is used to establish a database connection, returning a Connection.

Connection conn = DriverManager.getConnection("jdbc:postgresql://host:port/database", "username", "password");

The database can then be queried using statements. Basic statements can be created using conn.createStatement and have class Statement.

Statement stmt = conn.createStatement();

These statements can then be used with the execute methods by specifying the SQL to run.

Prepared statements allow for parameterisation of statements. The statement is checked and compiled once and can then be used multiple times with different parameters. This is more efficient than using a Statement.

PreparedStatement is a subclass of Statement and is created using using conn.prepareStatement. The parameters in the statement can be set using setters. The first argument is the parameter number (these are 1-indexed) and the second is the value.

PreparedStatement stmt = conn.prepareStatement("SELECT * FROM table WHERE attr1 = ? AND attr2 = ?");
stmt.setString(1, "attr1 value");
stmt.setInt(2, 42);

The above will result in the query SELECT * FROM table WHERE attr1 = 'attr1 value' AND attr2 = 42.

There are two methods for executing a statement

  • executeQuery for SELECT statements. This returns a ResultSet of values.
  • executeUpdate for UPDATE, DELETE and other data manipulation statements. This returns an the number of modified rows as an integer.
// for Statement
ResultSet result = stmt.executeQuery("SELECT * FROM table");
// or
int resultCount = stmt.executeUpdate("UPDATE table SET a = 2");
// for PreparedStatement
ResultSet result = stmt.executeQuery();
// or
int resultCount = stmt.executeUpdate();

The ResultSet can be iterated using the next() method. The values can be accessed with get methods. These take either the column name or number.

String a;
int b;
while(result.next()){
    a = result.getString(1);
    b = result.getInt("attr2");
}

SQL methods may throw an SQLException which can be caught and handled.

JOIN queries

  • CROSS JOIN - Cartesian product of two tables
  • NATURAL JOIN - Joins tables by attributes with the same name
  • (INNER) JOIN ... ON - Joins two tables on given attributes
  • OUTER JOIN - Like inner join, but rows missing attributes are padded with NULL instead of being ignored. LEFT OUTER JOIN only keeps the rows on the left, RIGHT OUTER JOIN keeps the right and FULL OUTER JOIN keeps both

Correlated subqueries

A correlated subquery is one which refers to its outer query.

EXISTS can be used to check whether a subquery returns anything.

Relational algebra

  • Projection: unary operator. $\pi_A$ defines the attributes of interest. Corresponds to SELECT A in SQL.
  • Selection: unary operator. $\sigma_c$ defines the tuples of interest using constraint $c$. Corresponds to a WHERE clause in SQL.
  • Product: binary operator. $\times$ defines the cartesian product, like CROSS JOIN in SQL.
  • Join: binary operator. $\bowtie$ corresponds to JOIN.
  • Set operators: $\cup$, $\cap$ and $-$.
  • Renaming: $\rho$, used to rename a relation or attribute.

Selection chooses rows, projection chooses columns.

Projections can't contain duplicate values in relational algebra.

If list1 is a proper superset of list2 then $\pi_\text{list 1}\left(\pi_\text{list 2}(R)\right)$ is illegal.

If list2 is a superset of list1 then $\pi_\text{list 2}\left(\pi_\text{list 1}(R)\right) = \pi_\text{list 1}(R)$.

$\rho_S(R)$ renames relation $R$ to $S$. $\rho_{B_1, ..., B_n}(R)$ renames attributes.

Two relations $R(A_1, ..., A_n)$ and $S(B_1, ..., B_n)$ are type compatible if $n = m$ and $dom(A_i) = dom(B_i)$ for $1 \leq i \leq n$. Set operations (union, intersect, difference) can be performed on type compatible relations.

A join between two relations is denoted $R \bowtie_\text{Join condition} S$ and is equivalent to a cross join followed by selection. For example, a table of students and departments could be joined using the cross product and selection, $\sigma_\text{student.depId = dep.id}(\text{student} \times \text{dep})$, or using the join operator $\text{student} \bowtie_\text{student.depId = dep.id} \text{dep}$.

An equijoin is a join with an equality condition, such as the $\text{student.depId = dep.id}$ above.

A natural join joins relations with attributes with the same name. The join condition is then specified implicitly and duplicate columns are omitted. Natural joins are denoted $\star$. For the previous example, $\text{student} \star \rho_{(\text{depId, depName})}(\text{dep})$ would join the relations on the shared attribute $\text{depId}$.

All operations seen so far can be expressed using only $\{\sigma, \pi, \rho, \cup, -, \times\}$.

The division operator ($\div$) is like the inverse of the cross product. $(R \times S) \div S = R$. More formally, let $R(Z)$ and $S(X)$ be relations such that $X \subseteq Z$ and $Y = Z - X$. $R(Z) \div S(X)$ is a relation $T(Y)$ and contains tuples $t$ such that for every $t_S$ in $S$ there exists a $t_R$ in $R$ such that $t$ is the $Y$ attributes from $t_R$ and the attributes from $X$ in $t_R$ equal $t_S$.

Consider a table of student IDs and their modules, $R$, and a table of some subset of student IDs, $S$. $R \div S$ would be the modules which are taken by all of the students in $S$.

Aggregate functions are denoted $_\text{grouping attributes} \frak{J} _\text{functions}$. For example, the maximum price by product ID would be denoted $ _\text{productId} \frak{J} _\text{MAXIMUM price}(\text{prices})$ which is equivalent to SELECT productId, max(price) FROM prices GROUP BY productId;.

Left and right outer joins are denoted with $=\bowtie$ and $\bowtie=$ respectively.

Relational calculus

Relational calculus defines predicates and propositions. Relational calculus is declarative, whereas relational algebra is propositional. Substituting all parameters in a predicate gives a proposition. Intension is the meaning of a predicate. Extension is the set of all instantiations for which the predicate holds.

Let $P(x)$ be a predicate. If an instance of $x$, $a$ is such that $P(a)$ is true then $a$ satisfies $P$. $P(x)$ is the membership predicate for the set of satisfying instances.

A predicate is like a schema and a proposition is like a tuple. An extension is like a relation state.

In addition to defining valid relations, relational calculus can be used to express queries. There are two types, tuple and domain relational calculus.

Tuple relational calculus

Statements are

  • declarative expressions
  • either true or false
  • contain variables which can be instantiated or quantified
  • describe relations among entities
  • can represent tuples

Queries consist of a relation (the range relation), a condition and the desired attributes. A tuple calculus query is a first-order logic formula over tuples. Queries are of the form $\{\text{attributes} \;|\; \text{conditions}\}$ where attributes are like relational algebra projections or SQL SELECT and conditions are either the range relation (SQL FROM), a selection condition (SQL WHERE) or a join condition (SQL JOIN).

$t.A$ denotes attribute $A$ of tuple $t$.

Atomic formulas are either

  • $R(t_i)$ where $R$ is a relation and $t_i$ is a tuple variable
  • $(t_i.A \;\mathbf{op}\; t_j.B)$ where $\mathbf{op} \in \{-, <, \leq, >, \geq, \neq\}$
  • $(t_i.A \;\mathbf{op}\; c)$ or $(c\;\mathbf{op}\;t_i.A)$ where $c$ is a constant

Compound formulas are $\operatorname{NOT}(F)$, $F_1 \operatorname{OR} F_2$, $F_1 \operatorname{AND} F_2$, $(\forall t) F$ and $(\exists t)F$.

For example, the query for names of employees who earn more than 5000 is $\{t.\text{firstName}, t.\text{lastName} \;|\; \operatorname{EMPLOYEE}(t) \operatorname{AND} t.\text{salary} > 5000\}$. In the above example, $\operatorname{EMPLOYEE}$ is the range relation.

Variables can be either free or bound. Quantifiers ($\forall$ and $\exists$) bind variables. Only free variables can be in the attributes part of a query. For example, $\{e.\text{lastName}, e.\text{firstName} \;|\; \operatorname{EMPLOYEE}(e) \text{ AND } (\exists p)(\exists w)(\operatorname{PROJECT}(p) \text{ AND } \operatorname{WORKS-ON}(w) \text{ AND } p.\text{departmentId} = 5 \text{ AND } w.\text{employeeId} = e.\text{id} \text{ AND } p.\text{id} = w.\text{projectId}\}$ selects employees who work on some project overseen by department 5. The SQL equivalent would be

SELECT e.lastName, e.firstName FROM employee e, project p, works_on w WHERE p.departmentId = 5 AND w.employeeId = e.id AND p.id = w.projectId;

A relational calculus expression is safe if it is guaranteed to return a finite number of tuples as a result. Queries which return only attribute values for a range relation are safe. An unsafe expression could be $\{t \;|\; \operatorname{NOT}(\operatorname{EMPLOYEE(t)})\}$. Safe expressions are equivalent to relational algebra.

Domain relational calculus

Tuple relational calculus variables range over tuples. Domain relational calculus variables range over attribute domains.

Atomic formulas

  • $R(x_1, ..., x_k)$ where $R$ is a relation of arity $k$
  • $x_1 \;\mathbf{op}\;x_j$ with the same definition of $\mathbf{op}$ as before
  • $(x_i \;\mathbf{op}\; c)$ or $(c \;\mathbf{op}\; x_i)$

Compound formulas are the same as for tuples.

The query to get the date of birth and address for an employee whose name is John Smith may look like $\{u, v \;|\; (\exists q)(\exists r)(\exists s)(\exists t)(\exists w)(\exists x)(\exists y)(\exists z)(\operatorname{EMPLOYEE}(qrstuvwxyz)) \text{ AND } q = \text{\lq John' AND } r = \text{\lq Smith'})\}$ where $qrstuvwxyz$ are all of the attributes in $\text{EMPLOYEE}$. For brevity, irrelevant quantifiers can be omitted so for the example above, only $(\exists q)(\exists r)(\exists s)$ need to be included. As another example, the full name and address of all employees in the research department could be queried using $\{q,s,v\;|\; (\exists z)(\exists l)(\exists m)(\operatorname{EMPLOYEE}(qrstuvwxyz) \text{ AND } \operatorname{DEPARTMENT}(lmno) \text{ AND } l = \text{\lq research' AND } m=z)\}$. Note that the other quantifiers should still be there, they are just omitted for brevity.

Codd's theorem states that relational algebra and relational calculus are equivalent in their expressive power. Languages equivalent in expressive power to relational algebra are known as relationally complete. SQL and relational calculus are relationally complete.

Normalisation

Normalisation is used to remove redundancies and make schemas easier to understand and use. Redundancy can lead to insert, update and delete anomalies which can make the database inconsistent.

Functional dependencies

Functional dependencies are constraints between sets of attributes derived from their meaning and relationships. They are used to identify good relational designs and to decide what to use as keys.

Given two sets of attributes, $X$ and $Y$, $X$ functionally determines $Y$ if the values of the $Y$ component of a tuple depend on the values of the $X$ component of the tuple. This is denoted $X \rightarrow Y$. The value of $X$ determines a unique value for $Y$.

Functional dependencies are derived from real world attribute constraints. For example, user ID functionally determines the username and password. If $X$ is a key, then $X \rightarrow Y$ for any non-key subset of attributes $Y$.

A superkey of a relational schema is a set of attributes which uniquely identifies each tuple in a relation. If $K$ is a superkey of $R$ then $K \rightarrow R$. A candidate key is a superkey with the fewest possible attributes. The primary key is the chosen candidate key. A prime or key attribute is one that is part of the primary key. It must be a member of some candidate key.

Decompositions

Effective schema decompositions are lossless or non-additive and preserve dependencies. Given a relation schema $R$ and a set $F$ of functional dependencies for $R$, $\{R_1, ..., R_k\}$ is a lossless-join decomposition of schema $R$ if for every legal relation instance $r$ of $R$, $\pi_{R_1}(r) \bowtie ... \bowtie \pi_{R_k}(r) = r$. Legal means $r$ satisfies $F$.

Given a relation schema $R$, the closure $F^+$ of a set of functional dependencies $F$ represents the set of all implied functional dependencies from $F$. Implied functional dependencies are derived using Armstrong's inference rules

  1. Reflexive - If $Y$ is a subset of $X$ then $X \rightarrow Y$
  2. Augmentation - If $X \rightarrow Y$ then $X \cup Z \rightarrow Y \cup Z$
  3. Transitive - If $X \rightarrow Y$ and $Y \rightarrow Z$ then $X \rightarrow Z$

These three rules for a sound and complete set of inference rules. Some additional useful rules which can be derived from the above are

  • Decomposition - If $X \rightarrow Y \cup Z$ then $X \rightarrow Y$ and $Y \rightarrow Z$
  • Union - If $X \rightarrow Y$ and $X \rightarrow Z$ then $X \rightarrow Y \cup Z$
  • Pseudotransitivity - If $X \rightarrow Y$ and $WY \rightarrow Z$ then $WX \rightarrow Z$

Given a relational schema $R$ and a set of functional dependencies $F$, $R_1$, $R_2$ is a lossless decomposition of $R$ if and only if $R_1 \cap R_2 \rightarrow (R_1 \setminus R_2)$ or $R_1 \cap R_2 \rightarrow (R_2 \setminus R_1)$ exist in $F^+$.

Normalisation decomposes relations by reducing redundancy while preserving dependencies using lossless joins.

Normal form

  • First normal form (1NF) - All attributes are atomic and there is a key
  • Second normal form (2NF) - Non-key attributes must be dependent the key
  • Third normal form (3NF) - Non-key attributes must only depend on the key
  • Boyce-Codd normal form (BCNF) - Every determinant of a non-trivial functional dependency is a superkey
  • Fourth, fifth and sixth normal form

3NF or BCNF are usually sufficient. All forms up to BCNF depend on functional dependencies.

First Normal Form

A relation is in 1NF if

  • All attributes are atomic, meaning they can't be broken down further. For example, an address is not atomic and could be broken into address line 1, address line 2, city and post code.
  • There is a key defined

1NF does not prevent insert/update/delete anomalies.

Second Normal Form

A relation is in 2NF if

  • It is in 1NF
  • Every non-key attribute is functionally dependent on any key
  • No proper subset of the key determines a non-key attribute

2NF prevents some insert/update/delete anomalies.

Third Normal Form

A relation is in 3NF if

  • It is in 2NF
  • There is no transitive functional dependency from a key to a non-key attribute (every non-key attribute depends only on the key)

It is always possible to find a decomposition that is in 3NF, dependency preserving and a lossless-join decomposition. 3NF also removes insert/update/delete anomalies.

Boyce-Codd Normal Form

A relation is in BCNF if whenever a functional dependency $X \rightarrow A$ holds then $X$ is a superkey. This means key attributes can't appear in the right hand side of a functional dependency. Most 3NF relations are also in BCNF.

Ideally normalisation will result in BCNF with lossless-join decomposition and dependency preservation. If this is not possible, the relation can either be in BCNF and lose dependency preservation or be in 3NF and preserve dependencies.

Security

Database security is needed to ensure integrity, availability and confidentiality. This is done through access control, inference control, flow control and encryption.

Discretionary Access Control

DAC involves granting and revoking privileges to different users. There are two types of privilege - account level and relation level. Account level privileges are those that apply to a user regardless of the table, such as CREATE SCHEMA and CREATE TABLE. Relation level privileges control access to each table or view.

GRANT can be used to grant privileges in SQL. For example, GRANT SELECT ON user TO username would grant select privileges on the user table to the user called username. UPDATE and INSERT privileges can specify attributes. For example, GRANT UPDATE ON users(address) TO username would allow updating the address attribute only.

SELECT privileges on views allow hiding of some attributes that would be available if the user had SELECT privileges on a table.

Privileges can be revoked using REVOKE.

GRANT ... WITH GRANT OPTION allows the recipient to also grant that privilege, but if their privilege is revoked so will that of those they grant it to.

Mandatory Access Control

MAC uses security classes to provide access control. For example, top secret > secret > classified > unclassified.

DAC is more flexible and commonly used but isn't as rigid as MAC, which is ideal for military or government use where security classes are appropriate.

Role-based Access Control

RBAC is an alternative to MAC and DAC which involves creating roles with privileges. Roles can be created or destroyed with CREATE ROLE and DESTROY ROLE respectively. GRANT ... ON role is used to give privileges to roles. GRANT ROLE ... TO ... is used to grant a user a role.

Inference control

Statistical databases are used to produce statistics about populations and contain confidential information about individuals which must not be accessed directly. This can be achieved by only allowing aggregate functions, but a series of queries may still reveal the values of individual tuples. This is called an inference attack.

Inference attacks can be prevented by

  • establishing a threshold on population size
  • not allowing repeated queries on the same population
  • partitioning records into larger groups and not allowing queries on subgroups
  • introducing noise to the data

Flow control

A flow occurs between $X$ and $Y$ when values are read from $X$ and written to $Y$. Flow policies specify the channels along which information is allowed to move, for example, to prevent classified information moving to an unclassified relation/attribute.

SQL injection

SQL injection occurs when parameter values given by a user are concatenated directly into a query. These parameters can be intentionally crafted to allow unauthorised access or modification of data. SQL injection is avoided by using parametrisation, which doesn't allow parameters to be interpreted as SQL.