SQL notes

Introduction

Structured Query Language (SQL) is a standard computer language for relational database management and data manipulation. It can be used to read, write, and update data. For data scientists, it is mainly used for data retrieval. There are several relational database management systems, such as Oracle, MySQL, and SQLite. Depending on the system you are using, the SQL syntax may be a little bit different.

Retrieve data

Use SELECT statement to query database, filtered by some conditions (optional). The [ ] is used to state optional clauses. To select all columns, use * after the SELECT keyword.

SELECT <col1>, <col2> 
FROM <tablename>
[WHERE <condition>];

Conditional selections used in the WHERE clause:

=              Equal
<>             Not equal to
>              Greater than
<              Less than
>=             Greater than or equal
<=             Less than or equal
BETWEEN AND    Within a range
IS NULL        Is a null
LIKE           Pattern matching operator
IN (<val1>, <val2>, ...)           
OR
AND
NOT
  • Sometimes IN and OR can accomplish the same thing. But in general, IN executes faster than OR and it allows to use another SELECT for subqueries.

  • SQL processes AND before OR. Use ( ) when using AND and OR together to avoid confusion in the order of operations.

Wildcards for pattern matching:

  • Using % for pattern matching: ‘%a’, end with a; ‘a%’, start with a; ‘%a%’, with a in between; ‘a%z’, start with a and end with z; ‘a%@gmail.com’, grab gmail addresses start with a.

  • Using underscore _ for matching a single character: ‘_day’, grab four-letter words that end with ‘day’.

  • Wildcards take longer to run than regular operators. Avoid using them if they can be replaced by other operators.

Example

The code below select two columns from a table named nameInfo and only select rows where the lastname ends in “son”.

SELECT firstname, lastname
FROM nameInfo
WHERE lastname LIKE '%son';

Create table

Use CREATE TABLE statement to create a new table, with optional constraints associated with each column respectively. The table and column names must start with a letter and can be followed by letters, numbers, or underscores The maximum length of a name is 30 characters.

CREATE TABLE <tablename>
(<column1> <data type> [constraint],
 <column2> <data type> [constraint]
);

Use CREATE TEMPORARY TABLE statement to create a temporary table. It is faster to create a temporary table than a real table. However, temporary tables will be deleted when current session is terminated.

CREATE TEMPORARY TABLE <temp_tablename> AS
(
  SELECT *
  FROM <tablename>
  [WHERE <conditions>]
)

Common data types in SQL

char(size)			Character string with a fixed length.
Varchar(size)		Character string with a maximum number of length.
date				Date value
number(size)		Number value with a maximum number of integer digits.
number(size, d)		Number value with a maximum number of "size" integer digits and a maximum number of "d" fraction digits.
decimal(n, d)

Common constraints

UNIQUE			No two records can have the same value in a particular column.
NOT NULL		A column can't be left blank.
PRIMARY KEY 	Defines a unique identification of each record (or row) in a table. Cannot contain NULL.

Example

CREATE TABLE nameInfo
(idnumber number(12) UNIQUE,
 firstname Varchar(20),
 lastname Varchar(20),
 age number(3),
 country Varchar(10)
);

Insert and delete rows

Use INSERT INTO statement to insert rows into a table. All strings should be enclosed in single quotes, for example ‘September’.

INSERT INTO <tablename>
       (col_1, ... , col_n)
VALUES (value_1, ..., value_n);

Use DELETE FROM statement to delete rows from a table.

DELETE FROM <tablename>
WHERE <col_1> OPERATOR <value_1>
[AND|OR <col_2> OPERATOR <value_2>];

Examples

INSERT INTO nameInfo
  (idnumber, firstname, lastname, age, country)
VALUES (1234567, 'John', 'Lee', 45, 'Canada');

DELETE FROM nameInfo
WHERE lastname = 'May';

Insert column

Add column to a table in a database:

AFTER TABLE <tablename>
ADD <columnname> <datatype>;
SELECT <col1>, <col2>, <col1> + <col2> AS <col3>

Update records

Use UPDATE statement to update records that match a specified criteria.

UPDATE <tablename>
SET <columnname> = <newvalue> [, <nextcolumn> = <newvalue2>, ... ]
WHERE <columnname> OPERATOR <value> 
[AND|OR <column> OPERATOR <value>];

Example

UPDATE nameInfo
SET lastname = 'Li', age = 48,
WHERE lastname = 'Lee' AND firstname = 'John';

Drop a table

Use DROP TABLE command to delete a table and all rows in the table. It is different from deleting all the records in the table. The former removes the table definition as well as all of its rows and the latter leaves the table including column and constraint information.

DROP TABLE <tablename>

Sort a table

Use ORDER BY statement to sort a table. It must be the last clause in a select statement.

SELECT <col1>, <col2>
FROM <tablename>
ORDER BY <col1> DESC, <col2> ASC, <col3> ASC;

You can sort by a column not retrieved and use DESC or ASC for descending or ascending (default) order.

Aggregate functions

AVG()
COUNT()
MIN()
MAX()
SUM()

  • Rows containing NULL values are ignored by AVG(), MIN(), MAX().

  • COUNT(*): count all the rows containing values or NULL;

  • COUNT(<colname>): count all the rows in a specific column ignoring NULL values;

  • Use DISTINCT on aggregate functions. For example, COUNT( DISTINCT <colname>).

Examples

SELECT AVG(score) AS average_score

Group data

Use GROUP BY clause to group data and use HAVING clause for filtering for groups. They are used after WHERE clause and hence rows filtered out by WHERE will not be included in GROUP BY clause.

SELECT <col1>, <col2>, [some aggregated calculations]
FROM <tablename>
GROUP BY <col1>, <col2>
[HAVING <condition>]; 

Every column in your SELECT statement must be present in a GROUP BY clause, except for aggregated calculations.

Example

SELECT customerID, COUNT(*) AS orders
FROM Orders
GROUP BY customerID
HAVING COUNT(*) >= 2 ; 

Using subqueries

Subqueries are queries embedded into other queries. Since data is often stored in multiple tables, subqueries are useful when it comes to getting information from multiple tables. Moreover, they are often used for adding additional criteria, such as filtering criteria, from another table into your query. Subquery selects can only retrieve a single column.

There is no limit to the number of subqueries you can have. But the performance in obtaining the results slows down when the subqueries are deeply nested.

Example

SELECT studentID, firstname, lastname, major
FROM studentList
WHERE studentiID IN (
  SELECT studentID
  FROM gradeList
  WHERE score > 80
);

Joining tables

A join allows you to retrieve the data from multiple tables in just one query. The result only exists for the duration of the query execution.

Cartesian (cross) joins

A Cartesian join allows you to take each record from the first table and match it with all of the records from the second table. If the first table contains x rows and second table contains y rows, you will have x*y rows in the end result. It is computationally taxing and is not matching on anything. So it is not frequently used.

SELECT <col1>, <col2>, <col3>
FROM  <table1>  CROSS JOIN <table2>;

Inner joins

An inner join is used to select records that have matching values in both tables on some keys.

SELECT <col1>, <col2>, <col3>
FROM <table1>  INNER JOIN <table2> 
ON <prequalified key1> = <prequalified key2> ;

Example

SELECT studentID, firstname, lastname, score
FROM studentList INNER JOIN gradeList
ON studentList.studentID = gradeList.studentID;

Aliases and self joins

An alias is helpful because it can help you by just shortening names and simplifying how we are pre-qualifying them. It doesn’t rewrite anything of the table permanently and is only stored for the duration of the query.

Example

SELECT studentID, firstname, lastname
FROM studentList AS stud, gradeList AS grade
WHERE stud.studentID = grade.studentID ;

A self join takes the table and treat it like two separate tables. The following example matches customers that are from the same city.

Example

SELECT A.name AS name1, B.name AS name2, A.city
FROM Customers A, Customers B
WHERE A.customerID = B.customberID
AND A.city = B.city
ORDER BY A.city;

Left, right, full outer joins

The left join returns all the records from the table on the left side and the matched records from the right table. The following example selects all customers and any orders they might have:

SELECT C.name, O.orderID
FROM Customers C
LEFT JOIN Orders O 
ON C.customerID = O.customerID
ORDER BY C.name

The difference between the right and left joins is the order the tables are relating. Right joins can be turned into left joins by reversing the order of the tables.

The full outer join returns all records when there is a match in either left or right table. The following example selects all customers and all orders.

SELECT C.name, O.orderID
FROM Customers C
FULL OUTER JOIN Orders O 
ON C.customerID = O.customerID
ORDER BY C.name ;

Unions

The UNION is used to combine the result set of two or more SELECT statements. Each SELECT statement within UNION must have the same number of columns and the columns must be in the same order with similar data types.

SELECT <colnames> FROM <table1>
UNION
SELECT <colnames> FROM <table2>

The following example selects the German cities that have suppliers.

SELECT city, country FROM Customers
WHERE country = 'germany'
UNION
SELECT city, country FROM supplier
WHERE country = 'germany'
ORDER BY city;

Working with text strings

Concatenations

Use || to concatenate strings. SQL server supports + instead of ||.

SELECT CompanyName, ContactName, CompanyName || '(' || ContactName || ')'
FROM customers

Trimming

Use LTRIM, RTRIM, TRIM to trim the leading, trailing, or both space from a string.

SELECT TRIM("	Hello.	") AS TrimmedString

Substring

Use SUBSTR to return the specified number of characters from a particular position of a given string.

SELECT firstname, SUBSTR (firstname, 2, 4)	# pull 4 characters, starting at the 2nd character 
FROM employees

Upper and lower

SELECT UPPER(firstname) FROM employees;
SELECT LOWER(firstname) FROM employees;

Working with date and time strings

SQLite supports 5 date and time functions:

DATE(timestring, modifier, modifier, ...)

TIME(timestring, modifier, modifier, ...)

DATETIME(timestring, modifier, modifier, ...)

JULIANDAY(timestring, modifier, modifier, ...)

# Extract certain parts of a date or time string.
# format components: %Y %m %d %H %M %S %s
STRFTIME(format, timestring, modifier, modifier, ...)

A timestring can be in any of the following formats

YYYY-MM-DD
YYYY-MM-DD HH:MM
YYYY-MM-DD HH:MM:SS
YYYY-MM-DD HH:MM:SS.SSS
YYYY-MM-DDTHH:MM
YYYY-MM-DDTHH:MM:SS
YYYY-MM-DDTHH:MM:SS.SSS
HH:MM
HH:MM:SS
HH:MM:SS.SSS

Examples

SELECT DATE('now')	# compute current date
SELECT STRFTIME('%Y %m %d', 'now')

SELECT Birthdate, 
	   STRFTIME('%Y', Birthdate) AS Year,
	   STRFTIME('%m', Birthdate) AS Month,
	   STRFTIME('%d', Birthdate) AS Day,
	   DATE(('now') - Birthdate) AS Age
FROM employees

Case statements

A case statement mimics if-else statement found in most programming languages. It can be used in SELECT, INSERT, UPDATE, and DELETE statements.

CASE [input_var_name]
WHEN C1 THEN E1
WHEN C2 THEN E2
...
[ELSE else_result]
END new_var_name

Example

SELECT studentid, firstname, lastname, score, 
	   CASE 
			WHEN score >= 60 THEN 'pass'
			WHEN score < 60 THEN 'fail'
			ELSE 'other'
	   END score_category
FROM exam_result

Views

A view is essentially a stored query. It will be removed after database connect has ended.

Benefits:

  • It can add or remove columns without changing the schema.

  • It provides a simpler option to creating a new table.

  • It helps us clean up our queries and simplify the queries when we have to write

  • It can be used to encapsulate complex queries or calculations that you are trying to write.

CREATE [TEMP] VIEW [IF NOT EXISTS] view_name AS
SELECT <list of column names>
FROM datatable

# To see the view
SELECT *
FROM view_name

# Remove the view
DROP VIEW view_name
Avatar
Tingting Yu
Developer, Data Scientist

My research interests include time-series analysis, longitudinal analysis, image analysis …