Take your SQL from Good to Great: Part 1
WITH CTEs, you can go far.
Last week I saw a
that got me thinking:
what are the SQL ‘hacks’ I would go back and teach myself sooner if I could?
This simple question quickly turned into a series of articles that expounds on some of my favorite methods explaining both why they’re useful, and how to use them. The topics I’ve selected are:
Part 1: Common Table Expressions
Part 2: All about those dates (coming soon!)
Part 3: The other JOINs (coming soon!)
Part 4: Window Functions (coming soon!)
Have a suggestion for an under-appreciated SQL method that’s made the difference for you? Drop me a comment!
Ok, on to Part 1: The CTE!
What’s a CTE?
Common Table Expressions (CTEs) are temporary tables that are available in a single query.
They use the following syntax:
WITH cte_name AS
(SELECT ... FROM ... )
SELECT * FROM cte_name;
Why they matter
To show why this seemingly benign query construction is so beneficial, let’s look at an example. Say I have two tables:
matches: a large file of tennis match statistics
players: a file of tennis player metadata
I want to find how old each player was when they won their
To do this, I’ll need to find the
time each player won a Grand Slam from the
table, then use the
table to compare that date to their birthdate.
approaches to this problem, but they largely boil down to two approaches:
While both approaches are similar in length, and the logic applied is identical, there are a few distinct advantages to the CTE approach:
1. It‘s easier to understand.
If Step 1 of answering this question involves getting the first time each player won a Grand Slam, shouldn’t that be the first thing you read?
With subqueries, the order of execution is nearly impossible to discern as you’re forced to scan for the lowest level indentation and work your way up and around until you finally reach the first line.
CTEs let you construct your query in a way that simply makes sense, benefiting not just you as you write the query, but also anyone else forced to try to interpret it.
2. Faster iterations.
What if I wanted to know the youngest players to get to a Grand Slam Final but lose? Or the oldest players to win a final in
tournaments, not just Grand Slams?
I can easily adjust my CTE
to answer these questions, without having to worry if I’m making the changes in the right subquery.
So much of data analysis involves these quick iterative tweaks to our queries as we have ‘conversations’ with the data. These “tweaks” can swell to painful tasks when we have complex queries; CTEs give you a common-sense structure to get these quick tweaks done…quickly.
3. Trustworthy Validations
Like every good analyst, I will
validate my results (😉), and CTEs make that far more straightforward. Since I can check each individual CTE I can quickly identify the source of any validation errors and troubleshoot a single logical step instead of dealing with a tangled mess of subqueries.
I think that’s 40–love to CTEs if you’re keeping score.
CTE power-up 👾
The benefits of CTEs at an individual query level are numerous, but what if we could take the benefits of the CTE and apply it across our
analysis, not just one query?
With the recent advancements in
, we can start to see what’s possible when we apply the CTE construct to a larger scale.
In these notebooks, each cell represents a CTE, and each of these cells can be referenced by any other cell, in effect, creating an entire connected graph of CTEs.
Our analysis done in Count.co
Our analysis represented as a connected graph. Image by Author.
What this enables you to do can be surprising. You still enjoy the benefits of a single CTE such as building up your analysis in a logical flow, making quick iterations and validation checks, but in addition, it allows for parameterization of queries and text.
The cumulative effect is something more like an app than a query.
If we wanted to expand our analysis to include a few more queries, we can chain more cells together. And if we wanted to include global filters, we can add those to the notebook and link those in like any other cell.
You can see the notebook for the above connected graph here:
And you can learn more about SQL notebooks
CTE best practices
Use unique and meaningful names of columns and CTEs (please don’t do WITH ‘cte’ AS…)
Try to make them ‘general’ so you can easily go back and tweak the logic
CTE Documentation for