Creating a Database

CREATE DATABASE analysis;

This statement creates a database on your server named analysis.

Connecting to Database

Setting up the conda environment

conda create -n sql python
conda activate sql
conda install ipython-sql sqlalchemy psycopg2 notebook pandas -c conda-forge

%load_ext sql

import os

host = "localhost"
database = "analysis"
user = os.getenv('SQL_USER')
password = os.getenv('SQL_PASSWORD')

connection_string = f"postgresql://{user}:{password}@{host}/{database}"

%sql $connection_string

'Connected: postgres@analysis'

Creating a Table

%%sql
CREATE TABLE employees (
    id serial,
    first_name varchar(25),
    last_name varchar(50),
    department varchar(50),
    hire_date date,
    salary numeric
);

 * postgresql://postgres:***@localhost/analysis
Done.

[]

Each column name represents one discrete data element defined by a data type. The id column is of data type serial, a special integer type that auto-increments every time you add a row to the table.

Inserting Rows into a Table

%%sql
INSERT INTO employees (first_name, last_name, department, hire_date, salary)
VALUES  ('Paul', 'Johnson', 'Sales', '2011-10-04', 85000),
        ('Nitin', 'Agarwal', 'Engineering', '2016-01-19', 65000),
        ('Tom', 'Rock', 'Engineering', '2015-09-15', 56000),
        ('Arun', 'Patil', 'Design', '2020-06-16', 44000),
        ('Nathan', 'Drake', 'Development', '2019-10-04', 68000),
        ('Tony', 'Brown', 'Development', '2021-10-04', 35000);

 * postgresql://postgres:***@localhost/analysis
6 rows affected.

[]

After the INSERT INTO keywords is the name of the table, and in parentheses are the columns to be filled. In the next row is the VALUES keyword and the data to insert into each column in each row. You need to enclose the data for each row in a set of parentheses, and inside each set of parentheses, use a comma to separate each column value. The order of the values must also match the other of the columns specified after the table name. Each row of data ends with a comma, and the last row ends the entire statement with a semicolon. Text and dates require quotes, whereas numbers, including integers and decimals, don't require quotes. The date format used is the international standard for date formats and using it will help you avoid confusion.

Basic SELECT Syntax

%%sql
SELECT * FROM employees;

 * postgresql://postgres:***@localhost/analysis
6 rows affected.

SELECT * fetches every row and column in a table. The * is a wildcard: a stand-in for a value and it represents everything that a value could possibly be.

Querying a Subset of Columns

%%sql
SELECT last_name, first_name, salary FROM employees;

 * postgresql://postgres:***@localhost/analysis
6 rows affected.

You can fetch a limited number of columns by naming columns, separated by commas, right after the SELECT keyword. You're able to retrieve columns in any order you like.

Finding Unique Values

To understand the range of values in a column, we can use the DISTINCT keyword as part of a query that eliminates duplicates and shows only unique values.

%%sql
SELECT DISTINCT department
FROM employees;

 * postgresql://postgres:***@localhost/analysis
4 rows affected.

The DISTINCT keyword also works on more than one column at a time. If we add a column, the query returns each unique pair of values.

%%sql
SELECT DISTINCT department, salary
FROM employees;

 * postgresql://postgres:***@localhost/analysis
6 rows affected.

This technique gives us the ability to ask, "For each x in the table, what are all the y values? For e.g., for each department, what are all the salaries?

Sorting Data

We order the results of a query using a clause containing the keywords ORDER BY followed by the name of the column or columns to sort. Applying this clause doesn't change the original table, only the result of the query.

%%sql
SELECT first_name, last_name, salary
FROM employees
ORDER BY salary DESC;

 * postgresql://postgres:***@localhost/analysis
6 rows affected.

ORDER BY sorts values in the ascending order (ASC) by default, but you can use the DESC keyword to sort in the descending order.
We're not limited to sorting on just one column.

%%sql
SELECT first_name, last_name, department, hire_date
FROM employees
ORDER BY department ASC, hire_date DESC;

 * postgresql://postgres:***@localhost/analysis
6 rows affected.

This statement shows us who are the newest employees in each department.

Filtering Rows with WHERE

You can limit the rows a query returns to only those in which one or more columns meet certain criteria.

%%sql
SELECT last_name, department, hire_date
FROM employees
WHERE department = 'Sales';

 * postgresql://postgres:***@localhost/analysis
1 rows affected.

The statement shows just the employees in the Sales department.

Comparison and Matching Operators

= Equal to
<> or != Not equal to
> Greater than
< Less than
>= Greater than or equal to
<= Less than or equal to
BETWEEN Within a range
IN Match one of a set of values
LIKE Match a pattern (case sensitive)
ILIKE Match a pattern (case insensitive)
NOT Negates a condition

Using LIKE and ILIKE with WHERE

Both operators let you search for patterns in strings by using two special characters:

Percent sign (%): a wildcard matching one or more characters
Underscore(_): a wildcard matching just one character.

For e.g., if you're trying to find the word sales, the following LIKE patterns will match it.

LIKE 's%'
LIKE '%al%'
LIKE '_ales'
LIKE 'sa_es'

The only difference between the LIKE and ILIKE operators is that LIKE is case-sensitive and ILIKE is not.

%%sql
SELECT last_name
FROM employees
WHERE last_name ILIKE 'john%'

 * postgresql://postgres:***@localhost/analysis
1 rows affected.

Combining Operators with AND and OR

We can combine comparison operators using the keywords AND and OR.

%%sql
SELECT *
FROM employees
WHERE department = 'Engineering'
    AND (salary < 30000 OR salary > 60000)

 * postgresql://postgres:***@localhost/analysis
1 rows affected.

Because we connect the two conditions using AND, both must be true for a row to meet the criteria in the WHERE clause and be returned in the query results. When we connect conditions using OR, only one of the conditions must be true for a row to meet the criteria of the WHERE clause. When we place statements inside parentheses, those are evaluated as a group before being combined with other criteria.

Character Data Types

char(n): A fixed-length column where the character length is specified by n. This is not used very often nowadays.
varchar(n): A variable length column where the maximum length is specified by n. If you insert fewer characters than the maximum, PostgreSQL will not store the extra spaces. The longer name for this data type is character varying(n).
text: a variable length column of unlimited length.

%%sql
CREATE TABLE char_data_types (
    varchar_column varchar(10),
    char_column char(10),
    text_column text 
);

INSERT INTO char_data_types
VALUES
    ('abc', 'abc', 'abc'),
    ('jklmnop', 'jklmnop', 'jklmnop');

 * postgresql://postgres:***@localhost/analysis
Done.
2 rows affected.

[]

COPY char_data_types TO 'C:\YourDirectory\typetest.txt'
WITH (FORMAT CSV, HEADER, DELIMITER '|');

We don't always need to specify the column names with the INSERT INTO statement. If the VALUES statements match the number of columns in the table, the database will assume you're inserting values in the order the column definitions were specified in the table.
Here, the COPY keyword exports the data to a text file named typetest.txt in a directory you specify. The directory must already exist; PostgreSQL won't create it for you.
COPY table_name FROM is the import function and COPY table_name TO is the export function.

Number Data Types

Integers

There are three integer types

smallint: Storage size of 2 bytes with a range from −32768 to +32767.
integer: Storage size of 4 bytes with a range from −2147483648 to +2147483647.
bigint: Storage size of 8 bytes with a range from −9223372036854775808 to +9223372036854775807.

When the data values will remain constrained i.e. days of the month or year, smallint makes sense.

Auto-incrementing Integers

When you add a column with serial type, PostgreSQL will auto-increment the value in the column each time you insert a row, starting with 1, up to the maximum of each integer type.

smallserial: Storage size of 2 bytes with a range from 1 to 32767
serial: Storage size of 4 bytes with a range from 1 to 2147483647.
bigserial: Storage size of 8 bytes with a range from 1 to 9223372036854775807.

Note that when it comes to serial data types, if a row is deleted, the value in that row is never replaced. If a row is insert is aborted, the sequence for the column will still be incremented.

Decimal Numbers

Fixed-Point Numbers: Also called the arbitrary precision type, is
- numeric(precision,scale): You give the argument precision as the maximum number of digits to the left and right of the decimal point, and the argument scale as the number of digits allowable on the right of the decimal point. Alternatively, you can specify this type using decimal(precision,scale). If you omit specifying a scale value, the scale will be set to zero, which creates an integer. If you omit specifying the precision and the scale, the database will store values of any precision and scale up to the maximum allowed.
Floating-Point Types: Also called variable-precision types, these are of two types
- real: Allows precision up to 6 decimal digits.
- double: Allows precision up to 15 decimal digits.

%%sql
CREATE TABLE number_data_types (
    numeric_column numeric(20,5),
    real_column real, 
    double_column double precision
);

INSERT INTO number_data_types
VALUES
    (.7, .7, .7),
    (2.13579, 2.13579, 2.13579),
    (2.1357987654, 2.1357987654, 2.1357987654);

SELECT * FROM number_data_types;

 * postgresql://postgres:***@localhost/analysis
Done.
3 rows affected.
3 rows affected.

Notes:

Unless your data uses decimals, stick with integer types.
Floating-point types are referred to as "in-exact" and can lead to unintended mathematical errors. The storage required by the numeric data type is variable, and can consume considerably more space than the floating point types.
If you're working with decimal data and need calculations to be exact (e.g. money, space travel, etc.), choose numeric/decimal.
When choosing a number type, err on the side of bigger.

Date and Time Data Types

timestamp: Records date and time. You can add the keywords with time zone to ensure that the time recorded for an event includes the time zone where it occurred. You can either use timestamp with time zone or just timestamptz.
date: Records just the date.
time: Records just the time.
interval: Holds a value representing a unit of time expressed in the format quantity unit. It doesn't record the start or end of a time period, only its length. You'll typically use this type for calculations or filtering on other date and time columns.

The International Organization for Standardization (ISO) format for dates and times is YYYY-MM-DD HH:MM:SS.

%%sql
CREATE TABLE date_time_types (
    timestamp_column timestamp with time zone,
    interval_column interval
);

INSERT INTO date_time_types
VALUES
    ('2018-12-31 01:00 EST','2 days'),
    ('2018-12-31 01:00 -8','1 month'),
    ('2018-12-31 01:00 Australia/Melbourne','1 century'),
    (now(),'1 week');

SELECT * FROM date_time_types;

 * postgresql://postgres:***@localhost/analysis
Done.
4 rows affected.
4 rows affected.

For the first row, we use the time zone abbreviation.
For the second row, we use the a value that represents the number of hours offset from the Coordinated Universal Time (UTC).
For the third row, we use the name of an area and location using values found in a standard time zone database.
For the fourth row, we use the now() function, which captures the current transaction time from your hardware.

Transforming Values from One Data Type to Another

The CAST() function lets you transform a value from its stored data type to another type, although it only succeeds when the target data type can accommodate the original value. For e.g., casting an integer as text is possible, but the reverse is not.
A shortcut notation for this function is ::. Insert the double colon in between the name of the column and the data type you want to convert it to.

%%sql
SELECT timestamp_column, CAST(timestamp_column AS varchar(10))
FROM date_time_types;

SELECT timestamp_column::varchar(10)
FROM date_time_types;

 * postgresql://postgres:***@localhost/analysis
4 rows affected.
4 rows affected.

Importing and Exporting Data

Delimited Text Files

A delimited text file contains rows of data, and each row represents one row in a table. In each row, a character separates, or delimits, each data column.

Columns that contain Delimiters

Sometimes the delimiter used might also be a piece of the data, for e.g. an address in a CSV file can use a comma as an input value. In such cases, delimited files wrap columns that contain a delimiter character with an arbitrary character called a text classifier that tells SQL to ignore the delimiter character help within. Most of the time in CSV files, the text qualifier used is the double quote.

John,Doe,"123 Main St., Apartment 200",Hyde Park,NY,845-555-1212

The database will recognize that the double quotes signify one column regardless of whether it finds a delimiter within the quotes.

Handling Header Rows

A header row is a single row at the top of the file that lists the name of each data field. Because PostgreSQL does not use the header row, we don't want that row imported to a table, so we'll use a HEADER option in the COPY command to exclude it.

Using COPY to Import Data

COPY table_name
FROM 'C:\YourDirectory\your_file.csv' WITH (FORMAT CSV, HEADER);

The block of code starts with the COPY keyword followed by the name of the target table, which must already exist in your database. The FROM keyword identifies the full path to the source file, including its name. The WITH keyword lets you specify options, surrounded by parentheses, that you can tailor to your input or output file. Here we specify that the external file should be comma-delimited, and that we should exclude the file's header row in the import.
Use the FORMAT format_name option to specify the type of file you're reading or writing. Format names are CSV, TEXT, or BINARY. On import, use HEADER to specify that the source file has a header row. On export, using HEADER tells the database to include the column names as a header row in the output file. The DELIMITER 'character' option lets you specify which character your import or export file uses as a delimiter. If you use FORMAT CSV, the assumed delimiter is a comma.

Using COPY to Export Data

The main difference between exporting and importing data with COPY is that rather than using FROM to identify the source data, you use TO for the path and name of the output file.

Exporting All Data:

COPY table_name
TO 'C:\YourDirectory\your_file.csv' WITH (FORMAT CSV, HEADER, DELIMITER '|');
Exporting Particular Columns:

COPY table_name (column_1, columns_2, column_3)
TO 'C:\YourDirectory\your_file.csv' WITH (FORMAT CSV, HEADER, DELIMITER '|');
Exporting Query Results:

COPY (
SELECT column_1, column_2
FROM table_name
WHERE column_1 ILIKE '%post%'
TO 'C:\YourDirectory\your_file.csv' WITH (FORMAT CSV, HEADER, DELIMITER '|');

Math with SQL

Basic Math Operators

+ Addition
- Subtraction
* Multiplication
/ Division (returns only the quotient, no remainder)
% Modulo (returns just the remainder)
+ Exponentiation
|/ or sqrt(n) Square root
||/ Cube root
factorial(n) Factorial (how many ways can a number of items be ordered?)

In calculations with an operator between two numbers - addition, subtraction, multiplication, and division - the data type returned follows this pattern:

Two integers return an integer.
A numeric on either side of the operator returns a numeric.
Anything with a floating-point number returns a floating-point number of type double precision.

However, the exponentiation, root, and factorial functions are different. Each takes one number either before or after the operator and returns numeric and floating-point types, even when the input is an integer. Sometimes the result's data type will suit your needs; other times, you may need to use CAST to change the data type

SELECT 2 + 2;
SELECT 4 * 8;
SELECT 12 / 5;
SELECT 12 % 5;
SELECT CAST (12 AS numeric(3,1)) / 5;
SELECT |/ 25;
SELECT 5!;
SELECT 5 ^ (8 - 4);

Math across Columns

%%sql
CREATE TABLE employees (
    id serial,
    first_name varchar(25),
    last_name varchar(50),
    department varchar(50),
    hire_date date,
    starting_salary numeric,
    present_salary numeric
);

INSERT INTO employees (first_name, last_name, department, hire_date, starting_salary, present_salary)
VALUES  ('Paul', 'Johnson', 'Sales', '2011-10-04', 65000, 85000),
        ('Nitin', 'Agarwal', 'Engineering', '2016-01-19', 58000, 66000),
        ('Tom', 'Rock', 'Engineering', '2015-09-15', 48000, 56000),
        ('Arun', 'Patil', 'Design', '2020-06-16', 42000, 44000),
        ('Nathan', 'Drake', 'Development', '2019-10-04', 65000, 66000),
        ('Tony', 'Brown', 'Development', '2021-10-04', 35000, 35000);

 * postgresql://postgres:***@localhost/analysis
Done.
6 rows affected.

[]

%%sql
SELECT id, 
    starting_salary,
    present_salary,
    present_salary - starting_salary AS "Difference"
FROM employees;

 * postgresql://postgres:***@localhost/analysis
6 rows affected.

The above table will give us the difference in the present and starting salary of the employees. The AS keyword is used to give a column a more readable alias.

Statistics with SQL

Finding Percentages of the Whole

To figure out the percentage of the whole, divide the number in question by the total, and then multiple by 100.

Tracking Percent Change

The formula to calculate percent change can be expressed like this:
((new number - old number) / old number) * 100

%%sql
SELECT first_name, starting_salary, present_salary, 
    round((present_salary - starting_salary) / present_salary * 100, 1) AS "pct_change"
FROM employees;

 * postgresql://postgres:***@localhost/analysis
6 rows affected.

The round() function here removes all but one decimal place. The function takes two arguments: the column or expression to be rounded, and the number of decimal places to display.

Averages and Sums

We can use the avg() and sum() functions to find the averages and sums respectively.

%%sql
SELECT sum(starting_salary) AS "Starting Sum",
    round(avg(present_salary), 0) as "Present Average"
FROM employees;

 * postgresql://postgres:***@localhost/analysis
1 rows affected.

Finding the Median

The median is the middle value in an ordered set of values. Averages are less helpful when the values are bunched, or skewed, toward on end of the distribution, or if the group includes outliers. On such occasions, the median is a better indicator of the data set. A good test is to calculate the average and the median for a group of values. If they're close, the group is probably normally distributed with a bell curve, and the average is useful. If they're apart, the values are not normally distributed and the median is the better representation.

Percentiles
Percentiles indicate the point in an ordered set of data below which a certain percentage of the data is found. The median is equivalent to the 50th percentile. We can use the percentile function to find out the median as well as other quantiles. The percentile_cont(n) function calculates percentiles as continuous values, which means that the result does not have to the one of the numbers in the data set but can be a decimal value in between of the two numbers. The percentile_disc(n) function returns only discrete values, which will return a rounded number from the set.

%%sql
CREATE TABLE percentile_test (
    numbers integer
);

INSERT INTO percentile_test (numbers) 
VALUES (1), (2), (3), (4), (5), (6);

SELECT
    percentile_cont(.5) WITHIN GROUP (ORDER BY numbers),
    percentile_disc(.5) WITHIN GROUP (ORDER BY numbers)
FROM percentile_test;

 * postgresql://postgres:***@localhost/analysis
Done.
6 rows affected.
1 rows affected.

%%sql
SELECT percentile_cont(.5) 
    WITHIN GROUP (ORDER BY present_salary) as "Median Salary"
FROM employees;

 * postgresql://postgres:***@localhost/analysis
1 rows affected.

You can also find out a bunch of other percentiles using an array as follows:

%%sql
SELECT percentile_cont(array[.25,.5,.75]) 
    WITHIN GROUP (ORDER BY numbers)
FROM percentile_test;

 * postgresql://postgres:***@localhost/analysis
1 rows affected.

You can use the unnest() function which makes the array easier to read by turning it into rows.

%%sql
SELECT unnest(
    percentile_cont(array[.25,.5,.75]) 
    WITHIN GROUP (ORDER BY numbers)) AS "quartiles"
FROM percentile_test;

 * postgresql://postgres:***@localhost/analysis
3 rows affected.

Finding the Mode

We can find the mode, which is the value that appears most often, using the mode() function.

%%sql
SELECT mode() 
    WITHIN GROUP (ORDER BY present_salary)
FROM employees;

 * postgresql://postgres:***@localhost/analysis
1 rows affected.

Joining Tables

The JOIN statement links one table to another in the database during a query, using matching values in columns we specify in both tables.

SELECT *
FROM table_a JOIN table_b
ON table_a.key_column = table_b.foreign_key_column

This is similar to earlier SELECT statements, but instead of naming one table in the FROM clause, we name a table, give the JOIN keyword, and then name a second table. The ON keyword follows, where we specify the columns we want to use to match values.
You could also use the ON clause for an expression that evaluates to true or false. For e.g., you could match where values values from one column are greater than or equal to values in another column:> ON table_a.key_column >= table_b.foreign_key_column

%%sql
CREATE TABLE departments (
    dept_id bigserial,
    dept varchar(100),
    city varchar(100),
    CONSTRAINT dept_key PRIMARY KEY (dept_id),
    CONSTRAINT dept_city_unique UNIQUE (dept, city)
);

CREATE TABLE employees (
    emp_id bigserial,
    first_name varchar(100),
    last_name varchar(100),
    salary integer,
    dept_id integer REFERENCES departments (dept_id),
    CONSTRAINT emp_key PRIMARY KEY (emp_id),
    CONSTRAINT emp_dept_unique UNIQUE (emp_id, dept_id)
);

INSERT INTO departments (dept, city)
VALUES
    ('Tax', 'Atlanta'),
    ('IT', 'Boston');

INSERT INTO employees (first_name, last_name, salary, dept_id)
VALUES
    ('Nancy', 'Jones', 62500, 1),
    ('Lee', 'Smith', 59300, 1),
    ('Soo', 'Nguyen', 83000, 2),
    ('Janet', 'King', 95000, 2);

 * postgresql://postgres:***@localhost/analysis
Done.
Done.
2 rows affected.
4 rows affected.

[]

In the departments table, the dept_id column is the table's primary key. A primary key is a column or collection of columns whose values uniquely identify each row in a table. A valid primary key column enforces certain constraints:

The column or collection of columns must have a unique value for each row. The values only need to be unique within a table.
The column or collection of columns can't have missing values.

You define the primary key for departments and employees using the CONSTRAINT keyword.

In the employees table, the emp_id column uniquely identifies each row, thereby making it the primary key. The dept_id column in the employees table refers to values in the departments table's primary key. This is a foreign key, which you add as a constraint when creating a table. A foreign key constraint requires a value entered in a column to already exist in the primary key of the table it references. So, values in dept_id in the employees table must exists in dept_id in the departments table; otherwise you can't add them. Unlike a primary key, a foreign key can be empty, and it can contain duplicate values.

Both tables also include a UNIQUE constraint, which guarantees that values in a column, or a combination of values in more than one column, are unique. In departments, it requires that each row have a unique pair of values for dept and city. In employees, each row must have a unique pair of emp_id and dept_id. You add these constraints to avoid duplicate data.

Querying multiple tables using JOIN

When you join tables in a query, the database connects rows in both tables where the columns you specified for the join have matching values. The query results then include columns from both tables if you requested them as part of the query. You also can use columns from the joined tables to filter results using a WHERE clause.

Let's look at the simple JOIN ... ON syntax first.

%%sql
SELECT *
FROM employees JOIN departments
ON employees.dept_id = departments.dept_id;

 * postgresql://postgres:***@localhost/analysis
4 rows affected.

In the above example, you include an asterisk to choose all columns from both tables. Next, the JOIN keyword goes between the two tables you want data from. Finally, you specify the columns to join the tables using the ON keyword. When you run the query, the results include all values from both tables where values in the dept_id column match.

JOIN Types

%%sql
CREATE TABLE schools_left (
    id integer CONSTRAINT left_id_key PRIMARY KEY,
    left_school varchar(30)
);

CREATE TABLE schools_right (
    id integer CONSTRAINT right_id_key PRIMARY KEY,
    right_school varchar(30)
);

INSERT INTO schools_left (id, left_school) 
VALUES
    (1, 'Oak Street School'),
    (2, 'Roosevelt High School'),
    (5, 'Washington Middle School'),
    (6, 'Jefferson High School');
    
INSERT INTO schools_right (id, right_school) 
VALUES
    (1, 'Oak Street School'),
    (2, 'Roosevelt High School'),
    (3, 'Morrison Elementary'),
    (4, 'Chase Magnet Academy'),
    (6, 'Jefferson High School');

 * postgresql://postgres:***@localhost/analysis
Done.
Done.
4 rows affected.
5 rows affected.

[]

JOIN or INNER JOIN: Returns rows from both tables where matching values are found in the joined columns of both tables.

%%sql
SELECT *
FROM schools_left JOIN schools_right
ON schools_left.id = schools_right.id;

 * postgresql://postgres:***@localhost/analysis
3 rows affected.

LEFT JOIN: Returns every row from the left table plus rows that match values in the joined column from the right table.
The LEFT JOIN and RIGHT JOIN keywords each return all rows from one table and display blank rows from the other table if no matching values are found in the joined columns.

%%sql
SELECT *
FROM schools_left LEFT JOIN schools_right
ON schools_left.id = schools_right.id;

 * postgresql://postgres:***@localhost/analysis
4 rows affected.

RIGHT JOIN: Returns every row from the right table plus rows that match the key values in the key column from the left table.

%%sql
SELECT *
FROM schools_left RIGHT JOIN schools_right
ON schools_left.id = schools_right.id;

 * postgresql://postgres:***@localhost/analysis
5 rows affected.

FULL OUTER JOIN: Returns every row from both tables and matches rows; then joins the rows where values in the joined columns match. If there's no match for a value in either the left or right table, the query result contains an empty row for the other table.

%%sql
SELECT *
FROM schools_left FULL OUTER JOIN schools_right
ON schools_left.id = schools_right.id;

 * postgresql://postgres:***@localhost/analysis
6 rows affected.

CROSS JOIN: The query returns each row in the left table with each row in the right table to present all possible combination of rows from both tables.
Definitely avoid this join on large tables.

%%sql
SELECT *
FROM schools_left CROSS JOIN schools_right;

 * postgresql://postgres:***@localhost/analysis
20 rows affected.

Selecting Specific Columns in a JOIN

To select particular columns, you use the SELECT keyword followed by the desired column names. When joining tables, you must include the column as well as its table name.

%%sql
SELECT schools_left.id AS left_id, schools_left.left_school,
    schools_right.right_school
FROM schools_left LEFT JOIN schools_right
ON schools_left.id = schools_right.id;

 * postgresql://postgres:***@localhost/analysis
4 rows affected.

Table Aliases

To create a table alias, we place a character or two after the table name when we declare it in the FROM clause. Those characters then serve as an alias we can use instead of the full table name.

%%sql
SELECT lt.id, lt.left_school, rt.right_school
FROM schools_left AS lt LEFT JOIN schools_right AS rt
ON lt.id = rt.id;

 * postgresql://postgres:***@localhost/analysis
4 rows affected.

Joining Multiple Tables

We can continue adding tables to a query as long as we have columns with matching values to join on.

%%sql
CREATE TABLE schools_enrollment (
    id integer,
    enrollment integer
);

CREATE TABLE schools_grades (
    id integer,
    grades varchar(10)
);

INSERT INTO schools_enrollment (id, enrollment)
VALUES
    (1, 360),
    (2, 1001),
    (5, 450),
    (6, 927);
    
INSERT INTO schools_grades (id, grades)
VALUES
    (1, 'K-3'),
    (2, '9-12'),
    (5, '6-8'),
    (6, '9-12');

 * postgresql://postgres:***@localhost/analysis
Done.
Done.
4 rows affected.
4 rows affected.

[]

%%sql
SELECT lt.id, lt.left_school, en.enrollment, gr.grades
FROM schools_left AS lt LEFT JOIN schools_enrollment AS en
    ON lt.id = en.id
LEFT JOIN schools_grades AS gr
    ON lt.id = gr.id;

 * postgresql://postgres:***@localhost/analysis
4 rows affected.

In the above SELECT query, we join schools_left to schools_enrollment using the tables' id fields. Next, the query joins schools_left to school_grades again on the id fields.

Find Rows with Missing Values with NULL

In SQL, NULL is a special value that represents a condition in which there's no data present or where the data is unknown because it wasn't included. You can use NULL across data types.
You can find empty rows by adding a WHERE clause to filter for NULL by using the phrase IS NULL. If you want to look for columns with data, use IS NOT NULL.

%%sql
SELECT *
FROM schools_left LEFT JOIN schools_right
ON schools_left.id = schools_right.id
WHERE schools_right.id IS NULL;

 * postgresql://postgres:***@localhost/analysis
1 rows affected.

Primary and Foreign Keys

We can declare a primary key using either the column constraint or the table constraint method.

Let's look at an example of a column constraint. The typical syntax is

column_name data_type CONSTRAINT key_name PRIMARY KEY;

%%sql
CREATE TABLE natural_key(
    license_id varchar(10) CONSTRAINT license_key PRIMARY KEY,
    first_name varchar(50),
    last_name varchar(50)
);

Note that in the column constraint syntax, you can omit the CONSTRAINT keyword and the name for the key, and simply use PRIMARY KEY.

Now let's look at an example of a table constraint, where we declare the CONSTRAINT after listing the final column. with the column we want to use as they key in parentheses.

%%sql
CREATE TABLE natural_key(
    license_id varchar(10),
    first_name varchar(50),
    last_name varchar(50),
    CONSTRAINT license_key PRIMARY KEY (license_id)
);

Note that you must use the table constraint syntax when you want to create a primary key using more than one column. In that case, you would list the columns in parentheses, separated by commas.

Composite Primary Key

A composite primary key is when you create a suitable key from a combination of columns.

CREATE TABLE natural_key_composite(
    license_id varchar(10),
    first_name varchar(50),
    last_name varchar(50),
    CONSTRAINT license_key PRIMARY KEY (license_id, last_name)
);

Auto-incrementing Surrogate Key

We've already seen the three serial types: smallserial, serial, and bigserial. An easy way to create a surrogate primary key is with an auto-incrementing integer using one of the serial data types.
PostgreSQL also allows you store and compare UUID values but it does not include functions for generating the UUID values in its core. Instead, it relies on the third-party modules that provide specific algorithms to generate UUIDs.

CREATE TABLE surrogate_key_example (
    order_number bigserial,
    product_name varchar(50),
    order_numberder_date date,
    CONSTRAINT order_key PRIMARY KEY (order_number)
);

Foreign Keys

CREATE TABLE licenses (
    license_id varchar(10),
    first_name varchar(50),
    last_name varchar(50),
    CONSTRAINT licenses_key PRIMARY KEY (license_id)
);

INSERT INTO licenses (license_id, first_name, last_name)
VALUES ('T229901', 'Lynn', 'Malero');

CREATE TABLE registrations (
    registration_id varchar(10),
    registration_date date,
    license_id varchar(10) REFERENCES licenses (license_id),
    CONSTRAINT registration_key PRIMARY KEY (registration_id, license_id)
);

INSERT INTO registrations (registration_id, registration_date, license_id)
VALUES ('A203391', '3/17/2017', 'T229901');

To maintain referential integrity, the foreign key constraint prevents us from deleting a row from the licenses table above before removing any related rows in the registrations table, because doing so would leave an orphaned record. To delete a row in licenses and have that action automatically delete any related rows in registrations, we can specify that behavior by adding ON DELETE CASCADE when defining the foreign key constraint.

CREATE TABLE registrations (
    registration_id varchar(10),
    registration_date date,
    license_id varchar(10) REFERENCES licenses (license_id) ON DELETE CASCADE,
    CONSTRAINT registration_key PRIMARY KEY (registration_id, license_id)
);

With the above query, deleting a row in licenses should also delete all related rows in registrations.

Constraints

The CHECK Constraint

A CHECK constraint evaluates whether data added to a column meets the expected criteria, which we specify with a logical test. We can implement it as a column or a table constraint.
For a column constraint, declare it in the CREATE TABLE statement after the column name and data type: CHECK (logical expression).
As a table constraint, use the syntax CONSTRAINT constraint_name CHECK (logical expression) after all columns are defined.

%%sql
CREATE TABLE check_constraint_example(
    user_id bigserial,
    user_role varchar(50),
    salary integer,
    CONSTRAINT user_id_key PRIMARY KEY (user_id),
    CONSTRAINT check_role_in_list CHECK (user_role IN ('Admin', 'Staff')),
    CONSTRAINT check_salary_not_zero CHECK (salary > 0)
);

 * postgresql://postgres:***@localhost/analysis
Done.

[]

If we use the table constraint syntax, we can also combine more than one test in single CHECK statement.

CONSTRAINT grad_check CHECK (credits >= 120 AND tuition = 'Paid')

CONSTRAINT sale_check CHECK (sale_price < retail_price)

The UNIQUE Constraint

You can ensure that a column has a unique value in each row by using the UNIQUE constraint. The difference between this and a primary key is that a unique constraint permits multiple NULL values in a column.

%%sql
CREATE TABLE unique_constraint_example (
    contact_id bigserial CONSTRAINT contact_id_key PRIMARY KEY,
    first_name varchar(50),
    last_name varchar(50),
    email varchar(200),
    CONSTRAINT email_unique UNIQUE(email)
);

 * postgresql://postgres:***@localhost/analysis
Done.

[]

The NOT NULL Constraint

The NOT NULL constraint prevents a column from accepting empty values.

%%sql
CREATE TABLE not_null_example (
    student_id bigserial,
    first_name varchar(50) NOT NULL,
    last_name varchar(50) NOT NULL,
    CONSTRAINT student_id_key PRIMARY KEY (student_id)
);

 * postgresql://postgres:***@localhost/analysis
Done.

[]

Modifying Constraints

You can remove a constraint or add one later to an existing table using ALTER TABLE, a command which makes changes to tables and columns. You can only add a constraint to an existing table if the data in the target column obeys the limits of the constraint.

To remove a primary key, foreign key, or a UNIQUE constraint:

ALTER TABLE table_name DROP CONSTRAINT constraint_name;

To remove a NOT NULL constraint:> ALTER TABLE table_name ALTER COLUMN column_name DROP NOT NULL; The following statements can be executed one at a time.

%%sql
ALTER TABLE not_null_example DROP CONSTRAINT student_id_key;
ALTER TABLE not_null_example ADD CONSTRAINT student_id_key PRIMARY KEY (student_id);
ALTER TABLE not_null_example ALTER COLUMN first_name DROP NOT NULL;
ALTER TABLE not_null_example ALTER COLUMN first_name SET NOT NULL;

 * postgresql://postgres:***@localhost/analysis
Done.
Done.
Done.
Done.

[]

Counting Rows and Values using count()

The count() aggregate function is used to check the number of tows and perform other counting tasks. If we use an asterisk as an input (count(*)), the function returns the number of table rows regardless of whether they include null values.

%%sql
CREATE TABLE employees (
    id serial,
    first_name varchar(25),
    last_name varchar(50),
    department varchar(50),
    hire_date date,
    starting_salary numeric,
    present_salary numeric
);

INSERT INTO employees (first_name, last_name, department, hire_date, starting_salary, present_salary)
VALUES  ('Paul', 'Johnson', 'Sales', '2011-10-04', 65000, 85000),
        ('Nitin', 'Agarwal', 'Engineering', '2016-01-19', 58000, 66000),
        ('Tom', 'Rock', 'Engineering', '2015-09-15', 48000, 56000),
        ('Arun', 'Patil', 'Design', '2020-06-16', 42000, 44000),
        ('Nathan', 'Drake', 'Development', '2019-10-04', 65000, 66000),
        ('Tony', 'Brown', 'Development', '2021-10-04', 35000, 35000);

 * postgresql://postgres:***@localhost/analysis
Done.
6 rows affected.

[]

%%sql
SELECT count(*)
FROM employees;

 * postgresql://postgres:***@localhost/analysis
1 rows affected.

Counting Values present in a Column

You can find the number of rows in a specific column that contain non-null values by supplying the name of a column as input to the count() function.

%%sql
SELECT count(present_salary)
FROM employees;

 * postgresql://postgres:***@localhost/analysis
1 rows affected.

Counting Distinct Values in a Column

Adding DISTINCT to the count() function causes the function to return a count of distinct values from a column.

%%sql
SELECT count(DISTINCT present_salary)
FROM employees;

 * postgresql://postgres:***@localhost/analysis
1 rows affected.

Finding Maximum and Minimum Values

The max() function returns the largest number in a column. The min() function returns the smallest number in a column.

%%sql
SELECT max(present_salary), min(present_salary)
FROM employees;

 * postgresql://postgres:***@localhost/analysis
1 rows affected.

Aggregating Data using GROUP BY

When you use the GROUP BY clause with aggregate functions, you can group results according to the values in one or more columns. On its own, GROUP BY eliminates duplicate values from the results, similar to DISTINCT.

%%sql
SELECT department
FROM employees
GROUP BY department;

 * postgresql://postgres:***@localhost/analysis
4 rows affected.

The GROUP BY clause follows the FROM clause and includes the You're not limited to grouping just one column.

%%sql
SELECT department, present_salary
FROM employees
GROUP BY department, present_salary
ORDER BY department, present_salary;

 * postgresql://postgres:***@localhost/analysis
6 rows affected.

Combining GROUP BY with count()

You can combine GROUP BY with an aggregate function like count() to pull more descriptive information form the data. For e.g., you can get the number of employees in each department. We ask for the values in the department column and a count of those values.

%%sql
SELECT department, count(*)
FROM employees
GROUP BY department
ORDER BY count(*) DESC;

 * postgresql://postgres:***@localhost/analysis
4 rows affected.

Note that when we select individual columns along with an aggregate function, we must include the columns in a GROUP BY clause. You can't group values by aggregating and have ungrouped column values in the same query.

Using GROUP BY on Multiple Columns with count()

We can check the number of employees in each department and the starting salary at each department.

%%sql
SELECT department, starting_salary, count(*)
FROM employees
GROUP BY department, starting_salary
ORDER BY starting_salary ASC, count(*) DESC;

 * postgresql://postgres:***@localhost/analysis
6 rows affected.

Filtering an Aggregate using HAVING

To filter results of aggregate functions, we need to use the HAVING clause. Aggregate functions, such as sum(), can't be used within a WHERE clause because they operate at the row level, and aggregate functions work across rows. The HAVING clause places conditions on groups created by aggregating.

%%sql
SELECT department,
    sum(starting_salary) str_sal, sum(present_salary) end_sal,
    round((cast(sum(present_salary) AS decimal(10,2)) - 
    sum(starting_salary)) / sum(present_salary)*100,2) AS pct_change

FROM employees
GROUP BY department, starting_salary
HAVING starting_salary > 50000
ORDER BY pct_change DESC;

 * postgresql://postgres:***@localhost/analysis
3 rows affected.

Checking the Length of a String using length()

length() is a string function that counts the number of characters in a string.
For e.g., you can find the longest last name.

%%sql
SELECT length(last_name) 
FROM employees
ORDER BY length(last_name) DESC;

 * postgresql://postgres:***@localhost/analysis
6 rows affected.

String Concatenation

Concatenation combines two or more string or non-string values into one. We use UPDATE in conjunction with the double-pip string operator ||, which performs concatenation.

%%sql
UPDATE employees
SET department = department || ' dept';

 * postgresql://postgres:***@localhost/analysis
6 rows affected.

[]

Modifying Tables, Columns and Data

Modifying Tables with ALTER TABLE

We can use ALTER TABLE to modify the structure of tables.

The code for adding a column:ALTER TABLE table_name ADD COLUMN column_name data_type;> The code to remove a column:ALTER TABLE table_name DROP COLUMN column_name;> The code to change the data type of a column:ALTER TABLE table_name ALTER COLUMN column_name SET DATA TYPE data_type;> The code to add a NOT NULL constraint to a column:ALTER TABLE table_name ALTER COLUMN column_name SET NOT NULL;> The code to remove a NOT NULL constraint:ALTER TABLE table_name ALTER COLUMN column_name DROP NOT NULL;

Modifying Values with UPDATE

The UPDATE statement modifies the data in a column in all rows or in a subset of rows that meet a condition. First we pass UPDATE the name of the table to update, and then pass the SET clause the column that contains the values to change. The new value to place in the column can be a string, number, the name of another column, or even a query or expression that generates a value.

UPDATE table_name SET column_name = value;

We can update values in multiple columns at a time by adding additional columns and source values, and separating each column and value statement with a comma.

UPDATE table_name SET column_a = value_1, column_b = value_2;

We can restrict the update to particular rows by adding a WHERE clause with some criteria that must be met before the update can happen.

UPDATE table_name SET column_a = value WHERE criteria;

We can also update one table with values from another table using a subquery, a query inside a query.

UPDATE table_a SET column = (SELECT column FROM table_b WHERE table_a.column = table_b.column) WHERE EXISTS (SELECT column FROM table_b WHERE table_a.column = table_b.column);

The value portion of the SET clause is a subquery, which is a SELECT statement inside parentheses that generates the value for the update. Similarly, the WHERE EXISTS clause uses a SELECT statement to generate values that serve as the filter for the update. If we didn't use this clause, we might inadvertently set some values to NULL without planning to.

PostgreSQL also offers a simpler syntax using a FROM clause for updating values across tables. UPDATE table_a SET column = table_b.column FROM table_b WHERE table_a.column = table_b.column

Deleting Data

SQL includes options to remove rows and columns from a table along with options to delete an entire table or database. It's easy to exclude unwanted data in queries using a WHERE clause, so decide whether you truly need to delete the data or can just filter it out. Cases where deleting may be the best solution include data with errors or data imported incorrectly.

Deleting Rows from a Table

Using a DELETE FROM statement, we can remove all rows from a table, or we can use a WHERE clause to delete only the portion that matches an expression we supply.

Delete all rows from a table:DELETE FROM table_name;> Delete selected rows:DELETE FROM table_name WHERE expression;

Deleting a Column from a Table

We can remove a column, including all data within the column, from a table by using the DROP keyword in the ALTER TABLE statement.

Remove a column:ALTER TABLE table_name DROP COLUMN column_name

Deleting a Table from a Database

Delete a table:DROP TABLE table_name;

Save or Revert Changes using Transaction Blocks

The only way to undo any changes after you run a DELETE or UPDATE query is to restore from a backup. However, you can check your changes before finalizing them and cancel the change if it's not what you intended. You do this by wrapping the SQL statement within a transaction block, which is a group of statements you define using the following keywords at the beginning and end of the query:

START TRANSACTION signals the start of the transaction block. In PostgreSQL, you can also use the non-ANSI SQL BEGIN keyword.
COMMIT signals the end of the block and saves all changes.
ROLLBACK signals the end of the block and reverts all changes.

Usually, database programmers employ a transaction block in complex database systems to define the start and end of a sequence of operations that perform one unit of work in a database.

Improving Performance when Updating Large Tables

Adding a column to a table and filling it with values can quickly inflate the table's size because the database creates a new version of the existing row each time a value is updated, and it doesn't delete the old row version.
Instead of adding a column and filling it with values, we can save disk space by copying the entire table and adding a populated column during the operation. Then, we rename the tables so the copy replaces the original, and the original becomes a backup.

CREATE TABLE employees_backup AS
SELECT *, '2021-01-12'::date AS end_date
FROM employees;

ALTER TABLE employees 
    RENAME TO employees_temp;

ALTER TABLE employees_backup 
    RENAME TO employees;
    
ALTER TABLE employees_temp
    RENAME TO employees_backup;

Finding Ranks

We'll focus on two rank functions, rank() and dense_rank(), which are both window functions that perform calculations across set of rows we specify using the OVER clause.

The difference between rank() and dense_rank() is the way they handle the next rank value after a tie: rank() includes a gap in the rank order, but dense_rank() does not.

%%sql
SELECT first_name, present_salary,
    rank() OVER (ORDER BY present_salary DESC),
    dense_rank() OVER (ORDER BY present_salary DESC)
FROM employees;

 * postgresql://postgres:***@localhost/analysis
6 rows affected.

In my personal opinion, rank() should be used more often since it more accurately reflects the total number of companies ranked.

Ranking within Subgroups with PARTITION BY

A PARTITION BY clause divides table rows according to values in a column we specify.

%%sql
SELECT first_name, department, present_salary, 
    rank() OVER (PARTITION BY department ORDER BY present_salary DESC)
FROM employees;

 * postgresql://postgres:***@localhost/analysis
6 rows affected.

Calculating Rates

Analysts often calculate a rate per 1,000 people, or some multiple of that number, for apples-to-apples comparisons. For e.g., if we're trying to find the rate of crime.

Say x is the number of offenses and p is the total population, then the rate of offenses per 1,000 people is
(x / p) * 1000

select city, st, population, crime,
    round((crime::numeric / population) * 1000, 1)
    as pc_per_1000
from crime_data
where population >= 500000
order by (crime::numeric / population) desc;

timestamp_column	interval_column
2018-12-31 11:30:00+05:30	2 days, 0:00:00
2018-12-31 14:30:00+05:30	30 days, 0:00:00
2018-12-30 19:30:00+05:30	36500 days, 0:00:00
2022-07-12 11:38:50.663655+05:30	7 days, 0:00:00

id	first_name	last_name	department	hire_date	salary
1	Paul	Johnson	Sales	2011-10-04	85000
2	Nitin	Agarwal	Engineering	2016-01-19	65000
3	Tom	Rock	Engineering	2015-09-15	56000
4	Arun	Patil	Design	2020-06-16	44000
5	Nathan	Drake	Development	2019-10-04	68000
6	Tony	Brown	Development	2021-10-04	35000

numeric_column	real_column	double_column
0.70000	0.7	0.7
2.13579	2.13579	2.13579
2.13580	2.1357987	2.1357987654

id	starting_salary	present_salary	Difference
1	65000	85000	20000
2	58000	66000	8000
3	48000	56000	8000
4	42000	44000	2000
5	65000	66000	1000
6	35000	35000	0

emp_id	first_name	last_name	salary	dept_id	dept_id_1	dept	city
1	Nancy	Jones	62500	1	1	Tax	Atlanta
2	Lee	Smith	59300	1	1	Tax	Atlanta
3	Soo	Nguyen	83000	2	2	IT	Boston
4	Janet	King	95000	2	2	IT	Boston

id	left_school	id_1	right_school
1	Oak Street School	1	Oak Street School
2	Roosevelt High School	2	Roosevelt High School
6	Jefferson High School	6	Jefferson High School

id	left_school	enrollment	grades
1	Oak Street School	360	K-3
2	Roosevelt High School	1001	9-12
5	Washington Middle School	450	6-8
6	Jefferson High School	927	9-12

quartiles
2.25
3.5
4.75

length
7
7
5
5
5
4