Why Every Software Engineer Should Learn SQL
Speed up your programs by properly utilising your database
Photo by Tobias Fischer on Unsplash
Almost every single software application relies on storing data somewhere accessible. For small scripts and programs, this is often in the computer’s memory, or sometimes just as plain text or binary files on the local disk.
Larger applications, however, will almost certainly use a database that exists externally to the program itself, and the most popular technology of choice for databases these days is the relational database. Most relational databases can be queried using Structured Query Language (SQL), so it’s imperative for a software engineer to have more than just a basic understanding of SQL to write efficient code for interacting with databases.
If you’re thinking “oh, but I already know how to select data from my database, surely
it’s not much harder than that”, think again! I’ve seen plenty of code that will use a
for
loop to accomplish a task that would be achieved much quicker and more simply with
a SQL JOIN
clause.
This article will use a very basic example to demonstrate just how powerful using SQL properly can be. The code for the article is written in Python using the SQLAlchemy ORM library, but I will include the raw SQL that the program intends to emulate to be more language-agnostic.
While the example and tests will seem simple, this is adapted from real code that I have seen in production, so pay attention!
TL;DR: the SQL JOIN
implementation is much faster.
The Example
The example I will use is one of an online store. The database will contain a users
table and an orders
table, which will store information on the site’s users and any
orders placed by those users. Each table has an ID column and the orders
table
contains a user_id
column containing a foreign key that references the users
table.
Python
from sqlalchemy import (
Boolean,
Column,
ForeignKey,
Integer,
String,
)
from sqlalchemy.orm import declarative_base
Base = declarative_base()
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
name = Column(String(50))
class Order(Base):
__tablename__ = 'orders'
id = Column(Integer, primary_key=True)
user_id = Column(Integer, ForeignKey('users.id'))
description = Column(String(50))
payment_status = Column(Boolean, index=True)
SQL
create table users (
id integer not null,
name varchar(50),
primary key (id)
);
create table orders (
id integer not null,
user_id integer,
description varchar(50),
payment_status boolean,
primary key (id),
foreign key(user_id) references users (id)
);
The Test Setup
As you can see, each order
has a payment_status
column, which contains a boolean for
whether the order was paid successfully or not. The aim is to test a function that finds
the users that have had at least one failed order and returns a set of User
objects
describing those users.
The test will be run against a local dockerised MySQL database, created using the
following docker-compose.yml
file:
services:
db:
image: mysql
environment:
MYSQL_ROOT_PASSWORD: password
MYSQL_DATABASE: default
MYSQL_USER: user
MYSQL_PASSWORD: password
ports:
- '3306:3306'
The database will be preloaded with a number of users and orders generated using the
Python faker
library. The test will be run with every combination of 10, 100 and 1000
users and orders, and timed over 100 iterations using the Python timeit
library.
In order to avoid inconsistencies used by database caching, each query will be prefixed
with the SQL_NO_CACHE
flag.
There will be two test functions. One will use a naive Python implementation that
involves querying for failed orders, then looping over the results and running a select
statement against the users
table on each iteration to generate a set of User
objects with at least one failed order. The second implementation will use SQL JOIN
and DISTINCT
to generate the same set of users in a single query.
The code for the functions can be seen below.
Python
from sqlalchemy.orm import Session
...
class User(Base):
...
class Order(Base):
...
def get_users_with_failed_orders_python(session: Session) -> set[User]:
"""Get all users with failed orders using a naive Python implementation."""
failed_orders = (
session
.query(Order)
.prefix_with('SQL_NO_CACHE')
.filter(Order.payment_status == False)
.all()
)
users = set()
for order in failed_orders:
user = (
session
.query(User)
.prefix_with('SQL_NO_CACHE')
.filter(User.id == order.user_id)
.one_or_none()
)
users.add(user)
return users
def get_users_with_failed_orders_sql(session: Session) -> set[User]:
"""Get all users with failed orders using a SQL implementation."""
return set(
session
.query(User)
.prefix_with('SQL_NO_CACHE')
.distinct()
.join(Order)
.filter(Order.payment_status == False)
.all()
)
SQL
/* Naive Python implementation */
select *
from orders
where not payment_status
-- Loop over the results from previous query
-- and run this select each time
select *
from users
where id = '{{ user_id }}'
/* SQL implementation */
select distinct users.*
from users
inner join orders
on users.id = orders.user_id
where not orders.payment_status
It should be fairly clear from the code that the implementation using the SQL JOIN
will make only one database query, whereas the more naive Python implementation will
make a minimum of 1 and a maximum of N queries where N is the number of rows in the
orders
table.
The Results
Unsurprisingly, the pure SQL implementation of the function is much, much faster than
the implementation that relies on a Python for
loop.
At 10 users and 10 orders, the Python function completed in 5.97ms, while the SQL
implementation took only 1.04ms. As the number of rows in the orders
table increased,
and so did the runtimes, but things started looking a lot worse for Python.
At 10 users and 1000 orders, the SQL implementation was still sitting pretty at 1.28ms, while the Python implementation took 458.49ms per test! This is, of course, because the SQL implementation made only one database call whereas the Python implementation will have made ~500 calls.
Interestingly, the number of rows in the users
column had very little impact on the
Python implementation in terms of runtime but had a larger impact on the SQL
implementation. This is likely due to the MySQL server having to do more work to find
DISTINCT
users in a larger list, whereas the Python version built the set
object
user by user.
The full results can be seen below, along with some charts showing how the performance of each function scaled with the number of rows in the database.
Benchmark output
n_users: 10, n_orders: 10
Python: 5.97ms, SQL: 1.04msn_users: 10, n_orders: 100
Python: 46.22ms, SQL: 1.22msn_users: 10, n_orders: 1,000
Python: 458.49ms, SQL: 1.28msn_users: 100, n_orders: 10
Python: 4.85ms, SQL: 0.96msn_users: 100, n_orders: 100
Python: 48.08ms, SQL: 1.40msn_users: 100, n_orders: 1,000
Python: 441.89ms, SQL: 2.53msn_users: 1,000, n_orders: 10
Python: 4.97ms, SQL: 1.04msn_users: 1,000, n_orders: 100
Python: 46.54ms, SQL: 1.57msn_users: 1,000, n_orders: 1,000
Python: 451.37ms, SQL: 5.75ms
Graphs
Image by author.
Image by author.
Image by author
Image by author
Conclusion
While both implementations above could certainly be optimised to squeeze out slightly more performance, the SQL implementation will win every single time. The best part is, the SQL used is actually very simple, and anyone can learn it!
There are plenty of free resources online for learning SQL, so if you’re a software engineer or an aspiring one, get out there and get studying! The benefits will be enormous.
If you enjoyed this article, you might also enjoy
my article about my Python library, quiffen
,
which is used for parsing financial information from QIF
files.
I can also be found on my blog, on Twitter and on LinkedIn, so if you have any questions, feel free to get chatting!
- Isaac
P.S., I’ve included the full code for the benchmarks I ran below, so have a go and play around with it all yourself!
import timeit
import faker
import matplotlib.pyplot as plt
import pandas as pd
from cycler import cycler
from sqlalchemy import (
Boolean,
Column,
ForeignKey,
Integer,
String,
create_engine,
)
from sqlalchemy.orm import Session, declarative_base, sessionmaker
Base = declarative_base()
plt.rcParams['axes.prop_cycle'] = cycler('color', plt.get_cmap('tab20').colors)
NUM_REPEATS = 100
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
name = Column(String(50))
class Order(Base):
__tablename__ = 'orders'
id = Column(Integer, primary_key=True)
user_id = Column(Integer, ForeignKey('users.id'))
description = Column(String(50))
payment_status = Column(Boolean, index=True)
def add_data(session: Session, n_users: int, n_orders: int):
f = faker.Faker()
users = [User(name=f.name()) for _ in range(n_users)]
session.add_all(users)
session.commit()
orders = [
Order(
user_id=f.random_int(1, n_users),
description=f.text(50),
payment_status=f.boolean(),
)
for _ in range(n_orders)
]
session.add_all(orders)
session.commit()
def empty_tables(session: Session):
session.query(Order).delete()
session.query(User).delete()
session.commit()
def setup_db(session: Session):
Base.metadata.create_all(session.bind)
def teardown_db(session: Session):
Base.metadata.drop_all(session.bind)
def get_sessionmaker() -> sessionmaker:
engine = create_engine('mysql+pymysql://user:password@localhost/default')
return sessionmaker(bind=engine)
def get_users_with_failed_orders_python(session: Session) -> set[User]:
"""Get all users with failed orders using a naive Python implementation."""
failed_orders = (
session
.query(Order)
.prefix_with('SQL_NO_CACHE')
.filter(Order.payment_status == False)
.all()
)
users = set()
for order in failed_orders:
user = (
session
.query(User)
.prefix_with('SQL_NO_CACHE')
.filter(User.id == order.user_id)
.one_or_none()
)
users.add(user)
return users
def get_users_with_failed_orders_sql(session: Session) -> set[User]:
"""Get all users with failed orders using a SQL implementation."""
return set(
session
.query(User)
.prefix_with('SQL_NO_CACHE')
.distinct()
.join(Order)
.filter(Order.payment_status == False)
.all()
)
def run_benchmark(
session: Session,
n_users: int,
n_orders: int,
) -> tuple[float, float]:
try:
setup_db(session)
add_data(session, n_users, n_orders)
exit()
python = timeit.timeit(
lambda: get_users_with_failed_orders_python(session),
number=NUM_REPEATS,
)
sql = timeit.timeit(
lambda: get_users_with_failed_orders_sql(session),
number=NUM_REPEATS,
)
return python / NUM_REPEATS * 1000, sql / NUM_REPEATS * 1000
finally:
empty_tables(session)
teardown_db(session)
def save_plot(
df: pd.DataFrame,
x: str,
y: str,
group: str,
title: str,
filename: str,
):
plt.set_cmap('tab10')
fig, ax = plt.subplots()
df = df.set_index(x)
df.groupby(group)[y].plot(
title=title,
ax=ax,
)
ax.set_ylabel('Time (ms)')
ax.legend(title=group)
fig.savefig(filename)
def main():
Session = get_sessionmaker()
results = []
with Session() as session:
teardown_db(session)
print('Starting benchmark...\n')
nums = (10, 100, 1000)
for n_users in nums:
for n_orders in nums:
print(f'n_users: {n_users:,}, n_orders: {n_orders:,}')
python, sql = run_benchmark(session, n_users, n_orders)
print(f'Python: {python:.2f}ms, SQL: {sql:.2f}ms\n')
results.append((n_users, n_orders, python, sql))
df = pd.DataFrame(
results,
columns=['n_users', 'n_orders', 'python', 'sql'],
)
df.to_csv('results.csv', index=False)
save_plot(
df,
'n_users',
'python',
'n_orders',
'Python query against n_users',
'python_n_users.png',
)
save_plot(
df,
'n_orders',
'python',
'n_users',
'Python query against n_orders',
'python_n_orders.png',
)
save_plot(
df,
'n_users',
'sql',
'n_orders',
'SQL query against n_users',
'sql_n_users.png',
)
save_plot(
df,
'n_orders',
'sql',
'n_users',
'SQL query against n_orders',
'sql_n_orders.png',
)
if __name__ == '__main__':
main()