PostgreSQL Tutorial
A blog post that outlines PostgreSQL basics
- Creating a Database
- Connecting to Database
- Creating a Table
- Inserting Rows into a Table
- Basic SELECT Syntax
- Querying a Subset of Columns
- Finding Unique Values
- Sorting Data
- Filtering Rows with WHERE
- Character Data Types
- Number Data Types
- Date and Time Data Types
- Transforming Values from One Data Type to Another
- Importing and Exporting Data
- Math with SQL
- Statistics with SQL
- Joining Tables
- JOIN Types
- Find Rows with Missing Values with NULL
- Primary and Foreign Keys
- Constraints
- Counting Rows and Values using count()
- Finding Maximum and Minimum Values
- Aggregating Data using GROUP BY
- Filtering an Aggregate using HAVING
- Checking the Length of a String using length()
- String Concatenation
- Modifying Tables, Columns and Data
- Deleting Data
- Save or Revert Changes using Transaction Blocks
- Improving Performance when Updating Large Tables
- Finding Ranks
- Calculating Rates
CREATE DATABASE analysis;
This statement creates a database on your server named analysis.
%load_ext sql
import os
host = "localhost"
database = "analysis"
user = os.getenv('SQL_USER')
password = os.getenv('SQL_PASSWORD')
connection_string = f"postgresql://{user}:{password}@{host}/{database}"
%sql $connection_string
%%sql
CREATE TABLE employees (
id serial,
first_name varchar(25),
last_name varchar(50),
department varchar(50),
hire_date date,
salary numeric
);
Each column name represents one discrete data element defined by a data type. The id column is of data type serial, a special integer type that auto-increments every time you add a row to the table.
%%sql
INSERT INTO employees (first_name, last_name, department, hire_date, salary)
VALUES ('Paul', 'Johnson', 'Sales', '2011-10-04', 85000),
('Nitin', 'Agarwal', 'Engineering', '2016-01-19', 65000),
('Tom', 'Rock', 'Engineering', '2015-09-15', 56000),
('Arun', 'Patil', 'Design', '2020-06-16', 44000),
('Nathan', 'Drake', 'Development', '2019-10-04', 68000),
('Tony', 'Brown', 'Development', '2021-10-04', 35000);
After the INSERT INTO keywords is the name of the table, and in parentheses are the columns to be filled. In the next row is the VALUES keyword and the data to insert into each column in each row. You need to enclose the data for each row in a set of parentheses, and inside each set of parentheses, use a comma to separate each column value. The order of the values must also match the other of the columns specified after the table name. Each row of data ends with a comma, and the last row ends the entire statement with a semicolon. Text and dates require quotes, whereas numbers, including integers and decimals, don't require quotes. The date format used is the international standard for date formats and using it will help you avoid confusion.
%%sql
SELECT * FROM employees;
SELECT * fetches every row and column in a table. The * is a wildcard: a stand-in for a value and it represents everything that a value could possibly be.
%%sql
SELECT last_name, first_name, salary FROM employees;
You can fetch a limited number of columns by naming columns, separated by commas, right after the SELECT keyword. You're able to retrieve columns in any order you like.
%%sql
SELECT DISTINCT department
FROM employees;
The DISTINCT keyword also works on more than one column at a time. If we add a column, the query returns each unique pair of values.
%%sql
SELECT DISTINCT department, salary
FROM employees;
This technique gives us the ability to ask, "For each x in the table, what are all the y values? For e.g., for each department, what are all the salaries?
%%sql
SELECT first_name, last_name, salary
FROM employees
ORDER BY salary DESC;
ORDER BY sorts values in the ascending order (ASC) by default, but you can use the DESC keyword to sort in the descending order.
We're not limited to sorting on just one column.
%%sql
SELECT first_name, last_name, department, hire_date
FROM employees
ORDER BY department ASC, hire_date DESC;
This statement shows us who are the newest employees in each department.
%%sql
SELECT last_name, department, hire_date
FROM employees
WHERE department = 'Sales';
The statement shows just the employees in the Sales department.
Comparison and Matching Operators
= Equal to
<> or != Not equal to
> Greater than
< Less than
>= Greater than or equal to
<= Less than or equal to
BETWEEN Within a range
IN Match one of a set of values
LIKE Match a pattern (case sensitive)
ILIKE Match a pattern (case insensitive)
NOT Negates a condition
Using LIKE and ILIKE with WHERE
Both operators let you search for patterns in strings by using two special characters:
- Percent sign (
%): a wildcard matching one or more characters - Underscore(
_): a wildcard matching just one character.
For e.g., if you're trying to find the word sales, the following LIKE patterns will match it.
LIKE 's%'
LIKE '%al%'
LIKE '_ales'
LIKE 'sa_es'
The only difference between the LIKE and ILIKE operators is that LIKE is case-sensitive and ILIKE is not.
%%sql
SELECT last_name
FROM employees
WHERE last_name ILIKE 'john%'
%%sql
SELECT *
FROM employees
WHERE department = 'Engineering'
AND (salary < 30000 OR salary > 60000)
Because we connect the two conditions using AND, both must be true for a row to meet the criteria in the WHERE clause and be returned in the query results. When we connect conditions using OR, only one of the conditions must be true for a row to meet the criteria of the WHERE clause. When we place statements inside parentheses, those are evaluated as a group before being combined with other criteria.
Character Data Types
-
char(n): A fixed-length column where the character length is specified by n. This is not used very often nowadays. -
varchar(n): A variable length column where the maximum length is specified by n. If you insert fewer characters than the maximum, PostgreSQL will not store the extra spaces. The longer name for this data type ischaracter varying(n). -
text: a variable length column of unlimited length.
%%sql
CREATE TABLE char_data_types (
varchar_column varchar(10),
char_column char(10),
text_column text
);
INSERT INTO char_data_types
VALUES
('abc', 'abc', 'abc'),
('jklmnop', 'jklmnop', 'jklmnop');
COPY char_data_types TO 'C:\YourDirectory\typetest.txt'
WITH (FORMAT CSV, HEADER, DELIMITER '|');
We don't always need to specify the column names with the INSERT INTO statement. If the VALUES statements match the number of columns in the table, the database will assume you're inserting values in the order the column definitions were specified in the table.
Here, the COPY keyword exports the data to a text file named typetest.txt in a directory you specify. The directory must already exist; PostgreSQL won't create it for you.
COPY table_name FROM is the import function and COPY table_name TO is the export function.
Number Data Types
Integers
There are three integer types
-
smallint: Storage size of 2 bytes with a range from −32768 to +32767. -
integer: Storage size of 4 bytes with a range from −2147483648 to +2147483647. -
bigint: Storage size of 8 bytes with a range from −9223372036854775808 to +9223372036854775807.
When the data values will remain constrained i.e. days of the month or year, smallint makes sense.
Auto-incrementing Integers
When you add a column with serial type, PostgreSQL will auto-increment the value in the column each time you insert a row, starting with 1, up to the maximum of each integer type.
-
smallserial: Storage size of 2 bytes with a range from 1 to 32767 -
serial: Storage size of 4 bytes with a range from 1 to 2147483647. -
bigserial: Storage size of 8 bytes with a range from 1 to 9223372036854775807.
Note that when it comes to serial data types, if a row is deleted, the value in that row is never replaced. If a row is insert is aborted, the sequence for the column will still be incremented.
Decimal Numbers
- Fixed-Point Numbers: Also called the arbitrary precision type, is
-
numeric(precision,scale): You give the argumentprecisionas the maximum number of digits to the left and right of the decimal point, and the argumentscaleas the number of digits allowable on the right of the decimal point. Alternatively, you can specify this type usingdecimal(precision,scale). If you omit specifying a scale value, the scale will be set to zero, which creates an integer. If you omit specifying the precision and the scale, the database will store values of any precision and scale up to the maximum allowed.
-
- Floating-Point Types: Also called variable-precision types, these are of two types
-
real: Allows precision up to 6 decimal digits. -
double: Allows precision up to 15 decimal digits.
-
%%sql
CREATE TABLE number_data_types (
numeric_column numeric(20,5),
real_column real,
double_column double precision
);
INSERT INTO number_data_types
VALUES
(.7, .7, .7),
(2.13579, 2.13579, 2.13579),
(2.1357987654, 2.1357987654, 2.1357987654);
SELECT * FROM number_data_types;
Notes:
- Unless your data uses decimals, stick with integer types.
- Floating-point types are referred to as "in-exact" and can lead to unintended mathematical errors. The storage required by the
numericdata type is variable, and can consume considerably more space than the floating point types. - If you're working with decimal data and need calculations to be exact (e.g. money, space travel, etc.), choose
numeric/decimal. - When choosing a number type, err on the side of bigger.
Date and Time Data Types
-
timestamp: Records date and time. You can add the keywordswith time zoneto ensure that the time recorded for an event includes the time zone where it occurred. You can either usetimestamp with time zoneor justtimestamptz. -
date: Records just the date. -
time: Records just the time. -
interval: Holds a value representing a unit of time expressed in the format quantity unit. It doesn't record the start or end of a time period, only its length. You'll typically use this type for calculations or filtering on other date and time columns.
The International Organization for Standardization (ISO) format for dates and times is YYYY-MM-DD HH:MM:SS.
%%sql
CREATE TABLE date_time_types (
timestamp_column timestamp with time zone,
interval_column interval
);
INSERT INTO date_time_types
VALUES
('2018-12-31 01:00 EST','2 days'),
('2018-12-31 01:00 -8','1 month'),
('2018-12-31 01:00 Australia/Melbourne','1 century'),
(now(),'1 week');
SELECT * FROM date_time_types;
For the first row, we use the time zone abbreviation.
For the second row, we use the a value that represents the number of hours offset from the Coordinated Universal Time (UTC).
For the third row, we use the name of an area and location using values found in a standard time zone database.
For the fourth row, we use the now() function, which captures the current transaction time from your hardware.
Transforming Values from One Data Type to Another
The CAST() function lets you transform a value from its stored data type to another type, although it only succeeds when the target data type can accommodate the original value. For e.g., casting an integer as text is possible, but the reverse is not.
A shortcut notation for this function is ::. Insert the double colon in between the name of the column and the data type you want to convert it to.
%%sql
SELECT timestamp_column, CAST(timestamp_column AS varchar(10))
FROM date_time_types;
SELECT timestamp_column::varchar(10)
FROM date_time_types;
Columns that contain Delimiters
Sometimes the delimiter used might also be a piece of the data, for e.g. an address in a CSV file can use a comma as an input value. In such cases, delimited files wrap columns that contain a delimiter character with an arbitrary character called a text classifier that tells SQL to ignore the delimiter character help within. Most of the time in CSV files, the text qualifier used is the double quote.
John,Doe,"123 Main St., Apartment 200",Hyde Park,NY,845-555-1212
The database will recognize that the double quotes signify one column regardless of whether it finds a delimiter within the quotes.
Using COPY to Import Data
COPY table_name
FROM 'C:\YourDirectory\your_file.csv'WITH (FORMAT CSV, HEADER);
The block of code starts with the COPY keyword followed by the name of the target table, which must already exist in your database. The FROM keyword identifies the full path to the source file, including its name. The WITH keyword lets you specify options, surrounded by parentheses, that you can tailor to your input or output file. Here we specify that the external file should be comma-delimited, and that we should exclude the file's header row in the import.
Use the FORMAT format_name option to specify the type of file you're reading or writing. Format names are CSV, TEXT, or BINARY. On import, use HEADER to specify that the source file has a header row. On export, using HEADER tells the database to include the column names as a header row in the output file. The DELIMITER 'character' option lets you specify which character your import or export file uses as a delimiter. If you use FORMAT CSV, the assumed delimiter is a comma.
Using COPY to Export Data
The main difference between exporting and importing data with COPY is that rather than using FROM to identify the source data, you use TO for the path and name of the output file.
-
Exporting All Data:
COPY table_name
TO 'C:\YourDirectory\your_file.csv'WITH (FORMAT CSV, HEADER, DELIMITER '|'); -
Exporting Particular Columns:
COPY table_name (column_1, columns_2, column_3)
TO 'C:\YourDirectory\your_file.csv'WITH (FORMAT CSV, HEADER, DELIMITER '|'); -
Exporting Query Results:
COPY (
SELECT column_1, column_2
FROM table_name
WHERE column_1 ILIKE '%post%'
TO 'C:\YourDirectory\your_file.csv'WITH (FORMAT CSV, HEADER, DELIMITER '|');
Math with SQL
Basic Math Operators
+ Addition
- Subtraction
* Multiplication
/ Division (returns only the quotient, no remainder)
% Modulo (returns just the remainder)
+ Exponentiation
|/ or sqrt(n) Square root
||/ Cube root
factorial(n) Factorial (how many ways can a number of items be ordered?)
In calculations with an operator between two numbers - addition, subtraction, multiplication, and division - the data type returned follows this pattern:
- Two integers return an integer.
- A numeric on either side of the operator returns a numeric.
- Anything with a floating-point number returns a floating-point number of type double precision.
However, the exponentiation, root, and factorial functions are different. Each takes one number either before or after the operator and returns numeric and floating-point types, even when the input is an integer. Sometimes the result's data type will suit your needs; other times, you may need to use CAST to change the data type
SELECT 2 + 2;
SELECT 4 * 8;
SELECT 12 / 5;
SELECT 12 % 5;
SELECT CAST (12 AS numeric(3,1)) / 5;
SELECT |/ 25;
SELECT 5!;
SELECT 5 ^ (8 - 4);
%%sql
CREATE TABLE employees (
id serial,
first_name varchar(25),
last_name varchar(50),
department varchar(50),
hire_date date,
starting_salary numeric,
present_salary numeric
);
INSERT INTO employees (first_name, last_name, department, hire_date, starting_salary, present_salary)
VALUES ('Paul', 'Johnson', 'Sales', '2011-10-04', 65000, 85000),
('Nitin', 'Agarwal', 'Engineering', '2016-01-19', 58000, 66000),
('Tom', 'Rock', 'Engineering', '2015-09-15', 48000, 56000),
('Arun', 'Patil', 'Design', '2020-06-16', 42000, 44000),
('Nathan', 'Drake', 'Development', '2019-10-04', 65000, 66000),
('Tony', 'Brown', 'Development', '2021-10-04', 35000, 35000);
%%sql
SELECT id,
starting_salary,
present_salary,
present_salary - starting_salary AS "Difference"
FROM employees;
The above table will give us the difference in the present and starting salary of the employees. The AS keyword is used to give a column a more readable alias.
%%sql
SELECT first_name, starting_salary, present_salary,
round((present_salary - starting_salary) / present_salary * 100, 1) AS "pct_change"
FROM employees;
The round() function here removes all but one decimal place. The function takes two arguments: the column or expression to be rounded, and the number of decimal places to display.
%%sql
SELECT sum(starting_salary) AS "Starting Sum",
round(avg(present_salary), 0) as "Present Average"
FROM employees;
Finding the Median
The median is the middle value in an ordered set of values. Averages are less helpful when the values are bunched, or skewed, toward on end of the distribution, or if the group includes outliers. On such occasions, the median is a better indicator of the data set. A good test is to calculate the average and the median for a group of values. If they're close, the group is probably normally distributed with a bell curve, and the average is useful. If they're apart, the values are not normally distributed and the median is the better representation.
Percentiles
Percentiles indicate the point in an ordered set of data below which a certain percentage of the data is found. The median is equivalent to the 50th percentile. We can use the percentile function to find out the median as well as other quantiles. The percentile_cont(n) function calculates percentiles as continuous values, which means that the result does not have to the one of the numbers in the data set but can be a decimal value in between of the two numbers. The percentile_disc(n) function returns only discrete values, which will return a rounded number from the set.
%%sql
CREATE TABLE percentile_test (
numbers integer
);
INSERT INTO percentile_test (numbers)
VALUES (1), (2), (3), (4), (5), (6);
SELECT
percentile_cont(.5) WITHIN GROUP (ORDER BY numbers),
percentile_disc(.5) WITHIN GROUP (ORDER BY numbers)
FROM percentile_test;
%%sql
SELECT percentile_cont(.5)
WITHIN GROUP (ORDER BY present_salary) as "Median Salary"
FROM employees;
You can also find out a bunch of other percentiles using an array as follows:
%%sql
SELECT percentile_cont(array[.25,.5,.75])
WITHIN GROUP (ORDER BY numbers)
FROM percentile_test;
You can use the unnest() function which makes the array easier to read by turning it into rows.
%%sql
SELECT unnest(
percentile_cont(array[.25,.5,.75])
WITHIN GROUP (ORDER BY numbers)) AS "quartiles"
FROM percentile_test;
%%sql
SELECT mode()
WITHIN GROUP (ORDER BY present_salary)
FROM employees;
Joining Tables
The JOIN statement links one table to another in the database during a query, using matching values in columns we specify in both tables.
SELECT *
FROM table_a JOIN table_b
ON table_a.key_column = table_b.foreign_key_column
This is similar to earlier SELECT statements, but instead of naming one table in the FROM clause, we name a table, give the JOIN keyword, and then name a second table. The ON keyword follows, where we specify the columns we want to use to match values.
You could also use the ON clause for an expression that evaluates to true or false. For e.g., you could match where values values from one column are greater than or equal to values in another column:> ON table_a.key_column >= table_b.foreign_key_column
%%sql
CREATE TABLE departments (
dept_id bigserial,
dept varchar(100),
city varchar(100),
CONSTRAINT dept_key PRIMARY KEY (dept_id),
CONSTRAINT dept_city_unique UNIQUE (dept, city)
);
CREATE TABLE employees (
emp_id bigserial,
first_name varchar(100),
last_name varchar(100),
salary integer,
dept_id integer REFERENCES departments (dept_id),
CONSTRAINT emp_key PRIMARY KEY (emp_id),
CONSTRAINT emp_dept_unique UNIQUE (emp_id, dept_id)
);
INSERT INTO departments (dept, city)
VALUES
('Tax', 'Atlanta'),
('IT', 'Boston');
INSERT INTO employees (first_name, last_name, salary, dept_id)
VALUES
('Nancy', 'Jones', 62500, 1),
('Lee', 'Smith', 59300, 1),
('Soo', 'Nguyen', 83000, 2),
('Janet', 'King', 95000, 2);
In the departments table, the dept_id column is the table's primary key. A primary key is a column or collection of columns whose values uniquely identify each row in a table. A valid primary key column enforces certain constraints:
- The column or collection of columns must have a unique value for each row. The values only need to be unique within a table.
- The column or collection of columns can't have missing values.
You define the primary key for departments and employees using the CONSTRAINT keyword.
In the employees table, the emp_id column uniquely identifies each row, thereby making it the primary key. The dept_id column in the employees table refers to values in the departments table's primary key. This is a foreign key, which you add as a constraint when creating a table. A foreign key constraint requires a value entered in a column to already exist in the primary key of the table it references. So, values in dept_id in the employees table must exists in dept_id in the departments table; otherwise you can't add them. Unlike a primary key, a foreign key can be empty, and it can contain duplicate values.
Both tables also include a UNIQUE constraint, which guarantees that values in a column, or a combination of values in more than one column, are unique. In departments, it requires that each row have a unique pair of values for dept and city. In employees, each row must have a unique pair of emp_id and dept_id. You add these constraints to avoid duplicate data.
Querying multiple tables using JOIN
When you join tables in a query, the database connects rows in both tables where the columns you specified for the join have matching values. The query results then include columns from both tables if you requested them as part of the query. You also can use columns from the joined tables to filter results using a WHERE clause.
Let's look at the simple JOIN ... ON syntax first.
%%sql
SELECT *
FROM employees JOIN departments
ON employees.dept_id = departments.dept_id;
In the above example, you include an asterisk to choose all columns from both tables. Next, the JOIN keyword goes between the two tables you want data from. Finally, you specify the columns to join the tables using the ON keyword. When you run the query, the results include all values from both tables where values in the dept_id column match.
%%sql
CREATE TABLE schools_left (
id integer CONSTRAINT left_id_key PRIMARY KEY,
left_school varchar(30)
);
CREATE TABLE schools_right (
id integer CONSTRAINT right_id_key PRIMARY KEY,
right_school varchar(30)
);
INSERT INTO schools_left (id, left_school)
VALUES
(1, 'Oak Street School'),
(2, 'Roosevelt High School'),
(5, 'Washington Middle School'),
(6, 'Jefferson High School');
INSERT INTO schools_right (id, right_school)
VALUES
(1, 'Oak Street School'),
(2, 'Roosevelt High School'),
(3, 'Morrison Elementary'),
(4, 'Chase Magnet Academy'),
(6, 'Jefferson High School');
JOIN or INNER JOIN: Returns rows from both tables where matching values are found in the joined columns of both tables.
%%sql
SELECT *
FROM schools_left JOIN schools_right
ON schools_left.id = schools_right.id;
LEFT JOIN: Returns every row from the left table plus rows that match values in the joined column from the right table.
The LEFT JOIN and RIGHT JOIN keywords each return all rows from one table and display blank rows from the other table if no matching values are found in the joined columns.
%%sql
SELECT *
FROM schools_left LEFT JOIN schools_right
ON schools_left.id = schools_right.id;
RIGHT JOIN: Returns every row from the right table plus rows that match the key values in the key column from the left table.
%%sql
SELECT *
FROM schools_left RIGHT JOIN schools_right
ON schools_left.id = schools_right.id;
FULL OUTER JOIN: Returns every row from both tables and matches rows; then joins the rows where values in the joined columns match. If there's no match for a value in either the left or right table, the query result contains an empty row for the other table.
%%sql
SELECT *
FROM schools_left FULL OUTER JOIN schools_right
ON schools_left.id = schools_right.id;
CROSS JOIN: The query returns each row in the left table with each row in the right table to present all possible combination of rows from both tables.
Definitely avoid this join on large tables.
%%sql
SELECT *
FROM schools_left CROSS JOIN schools_right;
%%sql
SELECT schools_left.id AS left_id, schools_left.left_school,
schools_right.right_school
FROM schools_left LEFT JOIN schools_right
ON schools_left.id = schools_right.id;
%%sql
SELECT lt.id, lt.left_school, rt.right_school
FROM schools_left AS lt LEFT JOIN schools_right AS rt
ON lt.id = rt.id;
%%sql
CREATE TABLE schools_enrollment (
id integer,
enrollment integer
);
CREATE TABLE schools_grades (
id integer,
grades varchar(10)
);
INSERT INTO schools_enrollment (id, enrollment)
VALUES
(1, 360),
(2, 1001),
(5, 450),
(6, 927);
INSERT INTO schools_grades (id, grades)
VALUES
(1, 'K-3'),
(2, '9-12'),
(5, '6-8'),
(6, '9-12');
%%sql
SELECT lt.id, lt.left_school, en.enrollment, gr.grades
FROM schools_left AS lt LEFT JOIN schools_enrollment AS en
ON lt.id = en.id
LEFT JOIN schools_grades AS gr
ON lt.id = gr.id;
In the above SELECT query, we join schools_left to schools_enrollment using the tables' id fields. Next, the query joins schools_left to school_grades again on the id fields.
Find Rows with Missing Values with NULL
In SQL, NULL is a special value that represents a condition in which there's no data present or where the data is unknown because it wasn't included. You can use NULL across data types.
You can find empty rows by adding a WHERE clause to filter for NULL by using the phrase IS NULL. If you want to look for columns with data, use IS NOT NULL.
%%sql
SELECT *
FROM schools_left LEFT JOIN schools_right
ON schools_left.id = schools_right.id
WHERE schools_right.id IS NULL;
%%sql
CREATE TABLE natural_key(
license_id varchar(10) CONSTRAINT license_key PRIMARY KEY,
first_name varchar(50),
last_name varchar(50)
);
Note that in the column constraint syntax, you can omit the CONSTRAINT keyword and the name for the key, and simply use PRIMARY KEY.
Now let's look at an example of a table constraint, where we declare the CONSTRAINT after listing the final column. with the column we want to use as they key in parentheses.
%%sql
CREATE TABLE natural_key(
license_id varchar(10),
first_name varchar(50),
last_name varchar(50),
CONSTRAINT license_key PRIMARY KEY (license_id)
);
Note that you must use the table constraint syntax when you want to create a primary key using more than one column. In that case, you would list the columns in parentheses, separated by commas.
CREATE TABLE natural_key_composite(
license_id varchar(10),
first_name varchar(50),
last_name varchar(50),
CONSTRAINT license_key PRIMARY KEY (license_id, last_name)
);
Auto-incrementing Surrogate Key
We've already seen the three serial types: smallserial, serial, and bigserial. An easy way to create a surrogate primary key is with an auto-incrementing integer using one of the serial data types.
PostgreSQL also allows you store and compare UUID values but it does not include functions for generating the UUID values in its core. Instead, it relies on the third-party modules that provide specific algorithms to generate UUIDs.
CREATE TABLE surrogate_key_example (
order_number bigserial,
product_name varchar(50),
order_numberder_date date,
CONSTRAINT order_key PRIMARY KEY (order_number)
);
CREATE TABLE licenses (
license_id varchar(10),
first_name varchar(50),
last_name varchar(50),
CONSTRAINT licenses_key PRIMARY KEY (license_id)
);
INSERT INTO licenses (license_id, first_name, last_name)
VALUES ('T229901', 'Lynn', 'Malero');
CREATE TABLE registrations (
registration_id varchar(10),
registration_date date,
license_id varchar(10) REFERENCES licenses (license_id),
CONSTRAINT registration_key PRIMARY KEY (registration_id, license_id)
);
INSERT INTO registrations (registration_id, registration_date, license_id)
VALUES ('A203391', '3/17/2017', 'T229901');
Deleting Related Records with CASCADE
To maintain referential integrity, the foreign key constraint prevents us from deleting a row from the licenses table above before removing any related rows in the registrations table, because doing so would leave an orphaned record. To delete a row in licenses and have that action automatically delete any related rows in registrations, we can specify that behavior by adding ON DELETE CASCADE when defining the foreign key constraint.
CREATE TABLE registrations (
registration_id varchar(10),
registration_date date,
license_id varchar(10) REFERENCES licenses (license_id) ON DELETE CASCADE,
CONSTRAINT registration_key PRIMARY KEY (registration_id, license_id)
);
With the above query, deleting a row in licenses should also delete all related rows in registrations.
Constraints
The CHECK Constraint
A CHECK constraint evaluates whether data added to a column meets the expected criteria, which we specify with a logical test. We can implement it as a column or a table constraint.
For a column constraint, declare it in the CREATE TABLE statement after the column name and data type: CHECK (logical expression).
As a table constraint, use the syntax CONSTRAINT constraint_name CHECK (logical expression) after all columns are defined.
%%sql
CREATE TABLE check_constraint_example(
user_id bigserial,
user_role varchar(50),
salary integer,
CONSTRAINT user_id_key PRIMARY KEY (user_id),
CONSTRAINT check_role_in_list CHECK (user_role IN ('Admin', 'Staff')),
CONSTRAINT check_salary_not_zero CHECK (salary > 0)
);
If we use the table constraint syntax, we can also combine more than one test in single CHECK statement.
CONSTRAINT grad_check CHECK (credits >= 120 AND tuition = 'Paid')
CONSTRAINT sale_check CHECK (sale_price < retail_price)
%%sql
CREATE TABLE unique_constraint_example (
contact_id bigserial CONSTRAINT contact_id_key PRIMARY KEY,
first_name varchar(50),
last_name varchar(50),
email varchar(200),
CONSTRAINT email_unique UNIQUE(email)
);
%%sql
CREATE TABLE not_null_example (
student_id bigserial,
first_name varchar(50) NOT NULL,
last_name varchar(50) NOT NULL,
CONSTRAINT student_id_key PRIMARY KEY (student_id)
);
Modifying Constraints
You can remove a constraint or add one later to an existing table using ALTER TABLE, a command which makes changes to tables and columns. You can only add a constraint to an existing table if the data in the target column
obeys the limits of the constraint.
To remove a primary key, foreign key, or a UNIQUE constraint:
ALTER TABLEtable_nameDROP CONSTRAINTconstraint_name;
To remove a NOT NULL constraint:> ALTER TABLE table_name ALTER COLUMN column_name DROP NOT NULL;
The following statements can be executed one at a time.
%%sql
ALTER TABLE not_null_example DROP CONSTRAINT student_id_key;
ALTER TABLE not_null_example ADD CONSTRAINT student_id_key PRIMARY KEY (student_id);
ALTER TABLE not_null_example ALTER COLUMN first_name DROP NOT NULL;
ALTER TABLE not_null_example ALTER COLUMN first_name SET NOT NULL;
%%sql
CREATE TABLE employees (
id serial,
first_name varchar(25),
last_name varchar(50),
department varchar(50),
hire_date date,
starting_salary numeric,
present_salary numeric
);
INSERT INTO employees (first_name, last_name, department, hire_date, starting_salary, present_salary)
VALUES ('Paul', 'Johnson', 'Sales', '2011-10-04', 65000, 85000),
('Nitin', 'Agarwal', 'Engineering', '2016-01-19', 58000, 66000),
('Tom', 'Rock', 'Engineering', '2015-09-15', 48000, 56000),
('Arun', 'Patil', 'Design', '2020-06-16', 42000, 44000),
('Nathan', 'Drake', 'Development', '2019-10-04', 65000, 66000),
('Tony', 'Brown', 'Development', '2021-10-04', 35000, 35000);
%%sql
SELECT count(*)
FROM employees;
%%sql
SELECT count(present_salary)
FROM employees;
%%sql
SELECT count(DISTINCT present_salary)
FROM employees;
%%sql
SELECT max(present_salary), min(present_salary)
FROM employees;
%%sql
SELECT department
FROM employees
GROUP BY department;
The GROUP BY clause follows the FROM clause and includes the
You're not limited to grouping just one column.
%%sql
SELECT department, present_salary
FROM employees
GROUP BY department, present_salary
ORDER BY department, present_salary;
%%sql
SELECT department, count(*)
FROM employees
GROUP BY department
ORDER BY count(*) DESC;
Note that when we select individual columns along with an aggregate function, we must include the columns in a GROUP BY clause. You can't group values by aggregating and have ungrouped column values in the same query.
%%sql
SELECT department, starting_salary, count(*)
FROM employees
GROUP BY department, starting_salary
ORDER BY starting_salary ASC, count(*) DESC;
Filtering an Aggregate using HAVING
To filter results of aggregate functions, we need to use the HAVING clause. Aggregate functions, such as sum(), can't be used within a WHERE clause because they operate at the row level, and aggregate functions work across rows. The HAVING clause places conditions on groups created by aggregating.
%%sql
SELECT department,
sum(starting_salary) str_sal, sum(present_salary) end_sal,
round((cast(sum(present_salary) AS decimal(10,2)) -
sum(starting_salary)) / sum(present_salary)*100,2) AS pct_change
FROM employees
GROUP BY department, starting_salary
HAVING starting_salary > 50000
ORDER BY pct_change DESC;
%%sql
SELECT length(last_name)
FROM employees
ORDER BY length(last_name) DESC;
%%sql
UPDATE employees
SET department = department || ' dept';
Modifying Tables, Columns and Data
Modifying Tables with ALTER TABLE
We can use ALTER TABLE to modify the structure of tables.
The code for adding a column:
ALTER TABLE table_name ADD COLUMN column_name data_type;> The code to remove a column:ALTER TABLE table_name DROP COLUMN column_name;> The code to change the data type of a column:ALTER TABLE table_name ALTER COLUMN column_name SET DATA TYPE data_type;> The code to add aNOT NULLconstraint to a column:ALTER TABLE table_name ALTER COLUMN column_name SET NOT NULL;> The code to remove aNOT NULLconstraint:ALTER TABLE table_name ALTER COLUMN column_name DROP NOT NULL;
Modifying Values with UPDATE
The UPDATE statement modifies the data in a column in all rows or in a subset of rows that meet a condition. First we pass UPDATE the name of the table to update, and then pass the SET clause the column that contains the values to change. The new value to
place in the column can be a string, number, the name of another column,
or even a query or expression that generates a value.
UPDATE table_name SET column_name = value;
We can update values in multiple columns at a time by adding additional columns and source values, and separating each column and value statement with a comma.
UPDATE table_name SET column_a = value_1, column_b = value_2;
We can restrict the update to particular rows by adding a WHERE clause with some criteria that must be met before the update can happen.
UPDATE table_name SET column_a = value WHERE criteria;
We can also update one table with values from another table using a subquery, a query inside a query.
UPDATE table_a SET column = (SELECT column FROM table_b WHERE table_a.column = table_b.column) WHERE EXISTS (SELECT column FROM table_b WHERE table_a.column = table_b.column);
The value portion of the SET clause is a subquery, which is a SELECT statement inside parentheses that generates the value for the update. Similarly, the WHERE EXISTS clause uses a SELECT statement to generate values
that serve as the filter for the update. If we didn't use this clause, we might
inadvertently set some values to NULL without planning to.
PostgreSQL also offers a simpler syntax using a FROM clause for updating values across tables.
UPDATE table_a
SET column = table_b.column
FROM table_b
WHERE table_a.column = table_b.column
Deleting Data
SQL includes options to remove rows and columns from a table along with options to delete an entire table or database. It's easy to exclude unwanted data in queries using a WHERE clause, so decide whether you truly need to delete the data or can just filter it out. Cases where deleting may be the best solution include data with errors or data imported incorrectly.
Deleting Rows from a Table
Using a DELETE FROM statement, we can remove all rows from a table, or we can use a WHERE clause to delete only the portion that matches an expression we supply.
Delete all rows from a table:
DELETE FROM table_name;> Delete selected rows:DELETE FROM table_name WHERE expression;
Save or Revert Changes using Transaction Blocks
The only way to undo any changes after you run a DELETE or UPDATE query is to restore from a backup. However, you can check your changes
before finalizing them and cancel the change if it's not what you intended.
You do this by wrapping the SQL statement within a transaction block, which
is a group of statements you define using the following keywords at the
beginning and end of the query:
-
START TRANSACTIONsignals the start of the transaction block. In PostgreSQL, you can also use the non-ANSI SQL BEGIN keyword. -
COMMITsignals the end of the block and saves all changes. -
ROLLBACKsignals the end of the block and reverts all changes.
Usually, database programmers employ a transaction block in complex database systems to define the start and end of a sequence of operations that perform one unit of work in a database.
Improving Performance when Updating Large Tables
Adding a column to a table and filling it with values can quickly inflate the table's size because the database creates a new version of the existing row each time a value is updated, and it doesn't delete the old row version.
Instead of adding a column and filling it with values, we can save disk space by copying the entire table and adding a populated column during the operation. Then, we rename the tables so the copy replaces the original, and the original becomes a backup.
CREATE TABLE employees_backup AS
SELECT *, '2021-01-12'::date AS end_date
FROM employees;
ALTER TABLE employees
RENAME TO employees_temp;
ALTER TABLE employees_backup
RENAME TO employees;
ALTER TABLE employees_temp
RENAME TO employees_backup;
Finding Ranks
We'll focus on two rank functions, rank() and dense_rank(), which are both window functions that perform calculations across set of rows we specify using the OVER clause.
The difference between rank() and dense_rank() is the way they handle the next rank value after a tie: rank() includes a gap in the rank order, but dense_rank() does not.
%%sql
SELECT first_name, present_salary,
rank() OVER (ORDER BY present_salary DESC),
dense_rank() OVER (ORDER BY present_salary DESC)
FROM employees;
In my personal opinion, rank() should be used more often since it more accurately reflects the total number of companies ranked.
%%sql
SELECT first_name, department, present_salary,
rank() OVER (PARTITION BY department ORDER BY present_salary DESC)
FROM employees;
Calculating Rates
Analysts often calculate a rate per 1,000 people, or some multiple of that number, for apples-to-apples comparisons. For e.g., if we're trying to find the rate of crime.
Say x is the number of offenses and p is the total population, then the rate of offenses per 1,000 people is
(x / p) * 1000
select city, st, population, crime,
round((crime::numeric / population) * 1000, 1)
as pc_per_1000
from crime_data
where population >= 500000
order by (crime::numeric / population) desc;