Introduction
Structured Query Language (SQL) is a standard computer language for relational database management and data manipulation. It can be used to read, write, and update data. For data scientists, it is mainly used for data retrieval. There are several relational database management systems, such as Oracle, MySQL, and SQLite. Depending on the system you are using, the SQL syntax may be a little bit different.
Retrieve data
Use SELECT statement to query database, filtered by some conditions (optional). The [ ] is used to state optional clauses. To select all columns, use * after the SELECT keyword.
SELECT <col1>, <col2>
FROM <tablename>
[WHERE <condition>];
Conditional selections used in the WHERE clause:
= Equal
<> Not equal to
> Greater than
< Less than
>= Greater than or equal
<= Less than or equal
BETWEEN AND Within a range
IS NULL Is a null
LIKE Pattern matching operator
IN (<val1>, <val2>, ...)
OR
AND
NOT
Sometimes IN and OR can accomplish the same thing. But in general, IN executes faster than OR and it allows to use another SELECT for subqueries.
SQL processes AND before OR. Use ( ) when using AND and OR together to avoid confusion in the order of operations.
Wildcards for pattern matching:
Using % for pattern matching: ‘%a’, end with a; ‘a%’, start with a; ‘%a%’, with a in between; ‘a%z’, start with a and end with z; ‘a%@gmail.com’, grab gmail addresses start with a.
Using underscore _ for matching a single character: ‘_day’, grab four-letter words that end with ‘day’.
Wildcards take longer to run than regular operators. Avoid using them if they can be replaced by other operators.
Example
The code below select two columns from a table named nameInfo and only select rows where the lastname ends in “son”.
SELECT firstname, lastname
FROM nameInfo
WHERE lastname LIKE '%son';
Create table
Use CREATE TABLE statement to create a new table, with optional constraints associated with each column respectively. The table and column names must start with a letter and can be followed by letters, numbers, or underscores The maximum length of a name is 30 characters.
CREATE TABLE <tablename>
(<column1> <data type> [constraint],
<column2> <data type> [constraint]
);
Use CREATE TEMPORARY TABLE statement to create a temporary table. It is faster to create a temporary table than a real table. However, temporary tables will be deleted when current session is terminated.
CREATE TEMPORARY TABLE <temp_tablename> AS
(
SELECT *
FROM <tablename>
[WHERE <conditions>]
)
Common data types in SQL
char(size) Character string with a fixed length.
Varchar(size) Character string with a maximum number of length.
date Date value
number(size) Number value with a maximum number of integer digits.
number(size, d) Number value with a maximum number of "size" integer digits and a maximum number of "d" fraction digits.
decimal(n, d)
Common constraints
UNIQUE No two records can have the same value in a particular column.
NOT NULL A column can't be left blank.
PRIMARY KEY Defines a unique identification of each record (or row) in a table. Cannot contain NULL.
Example
CREATE TABLE nameInfo
(idnumber number(12) UNIQUE,
firstname Varchar(20),
lastname Varchar(20),
age number(3),
country Varchar(10)
);
Insert and delete rows
Use INSERT INTO statement to insert rows into a table. All strings should be enclosed in single quotes, for example ‘September’.
INSERT INTO <tablename>
(col_1, ... , col_n)
VALUES (value_1, ..., value_n);
Use DELETE FROM statement to delete rows from a table.
DELETE FROM <tablename>
WHERE <col_1> OPERATOR <value_1>
[AND|OR <col_2> OPERATOR <value_2>];
Examples
INSERT INTO nameInfo
(idnumber, firstname, lastname, age, country)
VALUES (1234567, 'John', 'Lee', 45, 'Canada');
DELETE FROM nameInfo
WHERE lastname = 'May';
Insert column
Add column to a table in a database:
AFTER TABLE <tablename>
ADD <columnname> <datatype>;
SELECT <col1>, <col2>, <col1> + <col2> AS <col3>
Update records
Use UPDATE statement to update records that match a specified criteria.
UPDATE <tablename>
SET <columnname> = <newvalue> [, <nextcolumn> = <newvalue2>, ... ]
WHERE <columnname> OPERATOR <value>
[AND|OR <column> OPERATOR <value>];
Example
UPDATE nameInfo
SET lastname = 'Li', age = 48,
WHERE lastname = 'Lee' AND firstname = 'John';
Drop a table
Use DROP TABLE command to delete a table and all rows in the table. It is different from deleting all the records in the table. The former removes the table definition as well as all of its rows and the latter leaves the table including column and constraint information.
DROP TABLE <tablename>
Sort a table
Use ORDER BY statement to sort a table. It must be the last clause in a select statement.
SELECT <col1>, <col2>
FROM <tablename>
ORDER BY <col1> DESC, <col2> ASC, <col3> ASC;
You can sort by a column not retrieved and use DESC or ASC for descending or ascending (default) order.
Aggregate functions
AVG()
COUNT()
MIN()
MAX()
SUM()
Rows containing NULL values are ignored by AVG(), MIN(), MAX().
COUNT(*)
: count all the rows containing values or NULL;COUNT(<colname>)
: count all the rows in a specific column ignoring NULL values;Use DISTINCT on aggregate functions. For example,
COUNT( DISTINCT <colname>)
.
Examples
SELECT AVG(score) AS average_score
Group data
Use GROUP BY clause to group data and use HAVING clause for filtering for groups. They are used after WHERE clause and hence rows filtered out by WHERE will not be included in GROUP BY clause.
SELECT <col1>, <col2>, [some aggregated calculations]
FROM <tablename>
GROUP BY <col1>, <col2>
[HAVING <condition>];
Every column in your SELECT statement must be present in a GROUP BY clause, except for aggregated calculations.
Example
SELECT customerID, COUNT(*) AS orders
FROM Orders
GROUP BY customerID
HAVING COUNT(*) >= 2 ;
Using subqueries
Subqueries are queries embedded into other queries. Since data is often stored in multiple tables, subqueries are useful when it comes to getting information from multiple tables. Moreover, they are often used for adding additional criteria, such as filtering criteria, from another table into your query. Subquery selects can only retrieve a single column.
There is no limit to the number of subqueries you can have. But the performance in obtaining the results slows down when the subqueries are deeply nested.
Example
SELECT studentID, firstname, lastname, major
FROM studentList
WHERE studentiID IN (
SELECT studentID
FROM gradeList
WHERE score > 80
);
Joining tables
A join allows you to retrieve the data from multiple tables in just one query. The result only exists for the duration of the query execution.
Cartesian (cross) joins
A Cartesian join allows you to take each record from the first table and match it with all of the records from the second table. If the first table contains x
rows and second table contains y
rows, you will have x*y
rows in the end result. It is computationally taxing and is not matching on anything. So it is not frequently used.
SELECT <col1>, <col2>, <col3>
FROM <table1> CROSS JOIN <table2>;
Inner joins
An inner join is used to select records that have matching values in both tables on some keys.
SELECT <col1>, <col2>, <col3>
FROM <table1> INNER JOIN <table2>
ON <prequalified key1> = <prequalified key2> ;
Example
SELECT studentID, firstname, lastname, score
FROM studentList INNER JOIN gradeList
ON studentList.studentID = gradeList.studentID;
Aliases and self joins
An alias is helpful because it can help you by just shortening names and simplifying how we are pre-qualifying them. It doesn’t rewrite anything of the table permanently and is only stored for the duration of the query.
Example
SELECT studentID, firstname, lastname
FROM studentList AS stud, gradeList AS grade
WHERE stud.studentID = grade.studentID ;
A self join takes the table and treat it like two separate tables. The following example matches customers that are from the same city.
Example
SELECT A.name AS name1, B.name AS name2, A.city
FROM Customers A, Customers B
WHERE A.customerID = B.customberID
AND A.city = B.city
ORDER BY A.city;
Left, right, full outer joins
The left join returns all the records from the table on the left side and the matched records from the right table. The following example selects all customers and any orders they might have:
SELECT C.name, O.orderID
FROM Customers C
LEFT JOIN Orders O
ON C.customerID = O.customerID
ORDER BY C.name
The difference between the right and left joins is the order the tables are relating. Right joins can be turned into left joins by reversing the order of the tables.
The full outer join returns all records when there is a match in either left or right table. The following example selects all customers and all orders.
SELECT C.name, O.orderID
FROM Customers C
FULL OUTER JOIN Orders O
ON C.customerID = O.customerID
ORDER BY C.name ;
Unions
The UNION is used to combine the result set of two or more SELECT statements. Each SELECT statement within UNION must have the same number of columns and the columns must be in the same order with similar data types.
SELECT <colnames> FROM <table1>
UNION
SELECT <colnames> FROM <table2>
The following example selects the German cities that have suppliers.
SELECT city, country FROM Customers
WHERE country = 'germany'
UNION
SELECT city, country FROM supplier
WHERE country = 'germany'
ORDER BY city;
Working with text strings
Concatenations
Use ||
to concatenate strings. SQL server supports +
instead of ||
.
SELECT CompanyName, ContactName, CompanyName || '(' || ContactName || ')'
FROM customers
Trimming
Use LTRIM, RTRIM, TRIM to trim the leading, trailing, or both space from a string.
SELECT TRIM(" Hello. ") AS TrimmedString
Substring
Use SUBSTR to return the specified number of characters from a particular position of a given string.
SELECT firstname, SUBSTR (firstname, 2, 4) # pull 4 characters, starting at the 2nd character
FROM employees
Upper and lower
SELECT UPPER(firstname) FROM employees;
SELECT LOWER(firstname) FROM employees;
Working with date and time strings
SQLite supports 5 date and time functions:
DATE(timestring, modifier, modifier, ...)
TIME(timestring, modifier, modifier, ...)
DATETIME(timestring, modifier, modifier, ...)
JULIANDAY(timestring, modifier, modifier, ...)
# Extract certain parts of a date or time string.
# format components: %Y %m %d %H %M %S %s
STRFTIME(format, timestring, modifier, modifier, ...)
A timestring can be in any of the following formats
YYYY-MM-DD
YYYY-MM-DD HH:MM
YYYY-MM-DD HH:MM:SS
YYYY-MM-DD HH:MM:SS.SSS
YYYY-MM-DDTHH:MM
YYYY-MM-DDTHH:MM:SS
YYYY-MM-DDTHH:MM:SS.SSS
HH:MM
HH:MM:SS
HH:MM:SS.SSS
Examples
SELECT DATE('now') # compute current date
SELECT STRFTIME('%Y %m %d', 'now')
SELECT Birthdate,
STRFTIME('%Y', Birthdate) AS Year,
STRFTIME('%m', Birthdate) AS Month,
STRFTIME('%d', Birthdate) AS Day,
DATE(('now') - Birthdate) AS Age
FROM employees
Case statements
A case statement mimics if-else statement found in most programming languages. It can be used in SELECT, INSERT, UPDATE, and DELETE statements.
CASE [input_var_name]
WHEN C1 THEN E1
WHEN C2 THEN E2
...
[ELSE else_result]
END new_var_name
Example
SELECT studentid, firstname, lastname, score,
CASE
WHEN score >= 60 THEN 'pass'
WHEN score < 60 THEN 'fail'
ELSE 'other'
END score_category
FROM exam_result
Views
A view is essentially a stored query. It will be removed after database connect has ended.
Benefits:
It can add or remove columns without changing the schema.
It provides a simpler option to creating a new table.
It helps us clean up our queries and simplify the queries when we have to write
It can be used to encapsulate complex queries or calculations that you are trying to write.
CREATE [TEMP] VIEW [IF NOT EXISTS] view_name AS
SELECT <list of column names>
FROM datatable
# To see the view
SELECT *
FROM view_name
# Remove the view
DROP VIEW view_name