Objective
This style guide is intended to standardize how the CAX team writes production code. Ad-hoc queries can be written in whatever format pleases the developer. Once the code needs to go in production, the style guide should be applied.
In order to make applying these styling principles easier for developers, there is a SQL linter called that will easily apply these changes to whatever file a dev is changing. SQL Fluff
SQLFLuff is a SQL linter that’s used to define the basic structure and style of the SQL that we write, and move the review and style of that structure into the hands of the authors.
In order to utilize SQL fluff you’ll need to install it via pip , pypi , or your preferred package installer.
You can confirm its installation by getting SQLFluff to show its version number.
How to use
SQL Fluff has two core commands that are executed on a file
1) The first is lint which scans the file, or file path, you’ve designated for any SQL styling that doesn’t conform to the configured rules and prints out in your terminal which rules are not being followed
2) The second is fix which applies the correct styling to your file
We have added our own sqlfluff config file to the app-compliance-etl Squarewave repository so in whatever branch you’re working in you can run the sqlfluff commands to apply them to your production files.
If you don’t see a .sqlfluff config file in your app-compliance-etl local repo, make sure to run a git pull to update your local repo with any new changes in the remote main repo.
General Principles
Optimize for readability and maintainability rather than fewer lines of code Write code with the WET (write everything twice) programming principle in mind It's okay to write the same logic in two different places. Once we've written it a third time that logic should be standardized in a single upstream table Query Structure and Syntax Styling
General SQL Styling and Best Practices
This removes the need for devs to hit shift or caps lock to swap back and forth between cases IDEs used at Block, and Github, highlight keywords and functions to make it easier for readers to delineate between keywords and non-keyword syntax Improves readability of the code as it's the natural way lists are written Snowflake will be support for trailing commas soon - so the common complaint around using trailing commas will soon be a thing of the past Put each column on its own line and each SQL keyword on it's own line If there is a distinct qualifier, put it on the same line as select Each new column line should be indented 4 spaces Left align select statements if they're not in a CTE
Avoid using select * outside of import CTEs - it isn't explicit as to what columns are being used and some tables could have hundreds of columns which makes debugging difficult CTEs
Utilize CTEs instead of subqueries CTEs help to break your code into more readable blocks; ideally performing one logical body of work per CTE CTEs should be placed at the beginning of the query Use import CTEs to filter datasets This can help with over ETL performance, especially when joining tables later in the script Be explicit with CTE names. The CTE name should indicate what that code block is trying to accomplish Indent the select statement within your CTE 1 tab, or 4 spaces Below is an example of a properly formatted and named CTE Use a final CTE before your final select statement. This comes in handy when debugging a script where you have multiple CTEs since you can easily isolate the CTE that has the bug by replacing the select * from final with select * from cte_name CASE Statements
Case statements with one when clause can be written on one line Case statements with multiple when clauses should be written with each when clause on it's own line If a when clause has multiple conditions the first condition should be written on the same line as the when clause, each subsequent condition should go on a new line and indented from the first line Join Conditions
Be explicit in JOIN conditions - use inner join instead of just join If using an inner join, apply your filtering logic in the where clause, not the join clause Avoid using right joins - instead, restructure your query to use left joins Put the table referenced first immediately after the on in the join condition Joins with single conditions should be written on one line Joins with multiple conditions should have each condition written on it's own line Aggregation, Filtering, and Ordering
- group by column names, not numbers; same for order by
- group by all is your best friend if you need to group by a lot of columns; just make sure to put your aggregates at the end of your column list so it's understood what columns are being aggregated
- Each column in the group by , order by , where should have it's own line unless there is only one column being utilized
Use lateral column aliasing in your `group by` Avoid using order by to reduce compute costs - leave that to the BI tools or Excel/GSheet user to do that Break long lists of in values into multiple lines with one value per line Table and Column Naming Conventions
Be explicit with column and table names The more descriptive a column/table name is the less time someone querying, reading or updating the table will have to spend figuring it out what it is It's okay if column and table names seem obnoxiously long - embrace being explicit When aliasing columns use as to define the alias name Be descriptive with alias names; using a short form version of the table name is a good place to start Avoid using non-descriptive naming conventions such as id , name , and type - prefix these columns with what is being identified or named Booleans should start with has_ or is_ Timestamps should end with _at Dates should end with _date Avoid reserved keywords like date or month as column names Sample SQL