This is the first essay in a series about databases, relationality and a new approach to managing data that I’m working on. This essay is built upon in
In this section I’m going to define what we mean by “relational” databases, and tell a sort of simplified history of the “rise and fall” of so-called No-SQL databases that tried to move away from relationality. If you lived through this period this may not teach you anything you don’t know, so feel free to skip, but I want to establish some terms and context for what comes next.
This section is long, but the topic I’m discussing is far, far larger than I’m able to do justice to in this essay. I am calling this a short and mostly wrong history of databases because I’m making a lot of generalizations, so if you understand this stuff and see me get something wrong or see me miss some nuance that matters to you, please bear with me. I’m telling a story here, and defining some concepts that we’ll need to work with later.
Relationality and SQL
The core idea is that there is more than one way to build relationality into a database. Normalized tabular structures have been the go-to model for 50 years, because Codd’s
turned out to be a very useful way to think about capturing fully relational structures within linear byte arrays.
(”Relationality”, for the uninitiated, is a sort of property of datasets above some basic level of complexity that says that everything is sort of defined in terms of everything else. We have our complex real-world data, and we break it up into different tables each of which captures a specific aspect of it, and then we “relate” those tables to each other such that the actual things we are looking at can be assembled by grabbing different rows from different tables according to arbitrarily complex logic and then merging those rows into some final format. More below!)
I love SQL. I’m not saying SQL is bad, or suggesting that we deprecate RDBMS tooling, or anything like that. But I do think that there’s some utility in considering an alternative mechanism for capturing relationality in data.
I’ve written a bit more about it under
Broken link
, but the TLDR is something like this:
Let’s consider some reasonably complex domain of data - for our purposes, let’s talk about “the movies”, as represented in
Without digging into the concrete specifics of IMDB’s approach, let’s put on our Relational Database Hat and think about what their underlying SQL structure might look like. We can assume different tables for various domain models, so let’s imagine a People table and a Movies table and a Characters table.
We’ll then also need tables to capture Casting, for instance, which would contain rows for things like Person A acted in Movie B as Character C, and so on and so forth until everything we care about is normalized out to e.g. the
This is how we generally think about relational data. Every time we want to capture some new aspect of our domain data we either add columns to existing tables or we add new tables and then join them to the others via whatever rules make sense. And so by doing this with sufficient rigor we can eventually get to a place where we can ask our database to play seven degrees of kevin bacon and that’s a very efficient, optimized query. With me so far? Historically, the technology used to do this was called SQL (Sequential Query Language), and SQL is still how a lot of databases work today.
The Rise of No-SQL
But about 15 years ago there was this emerging “disruption” into the database space. We started to get “No-SQL” database, like
, which said “oh my god that relational stuff is so complicated I just want to write a JSON object into a queryable index why are you asking me to learn about normal forms?” They pointed out, rightly, that one of the big drawbacks of relational databases was that they required a degree of expertise to navigate complexity in service to a flexibility that wasn’t necessarily useful to all application domains.
A common argument was that “Well look, if we’re using a relational database, that means we need to understand and anticipate the shape of our data from the outset. What if the schema changes? That’s actually a big complex thing to have to navigate, migrations are expensive and sometimes high risk, and it adds a ton of friction to everything we do. We want to be fast! We want to be nimble! We want to support arbitrary schemas and not worry about how that changes our tables!”
And so we got a bunch of technologies that sidestepped the complexity of relational databases in favor of things like
. “Look, I have a ton of data and it’s very complex, let me just create a JSON schema and then write that to an engine that allows me to do queries on the data in the JSON. I don’t need relationality because I’m not going to break my data up across a dozen tables, I’m just going to store stand-alone isolated documents where all related data lives together. And look, if I need to add a field to my data model I can just update the JSON in the database in a single pass, or just start using the new field, easy-peasy!”
And around 2010 there was much rejoicing as software engineers started modeling their data as basically persistent object stores with dynamic schemas. Anyone in the startup scene in those days may remember the sort of liberatory and almost ideological affinity for things like CouchDB over the “old and complex” SQL engines of yore.
The Fall of No-SQL
And anyone in the startup scene five years later may remember how many companies were investing in huge initiatives to migrate their data from brittle No-SQL stores into normalized, scalable SQL engines. No-SQL has its place, to this day - sometimes a document store IS the right solution - but silicon valley seemed to learn, at great expense, that for your canonical source of truth you really want a relational database. Why?
Because you can’t know in advance which data is going to be related to which other data. As your business grows it turns out your data domain evolves, often in unpredictable ways. Maybe you’ve got a large document store of inventory items, and suddenly you realize you need to be able to answer questions about how items in that inventory relate to each other. In a SQL database that’s easy - you’ve already broken these things down across a normalized set of tables, all you have to do is construct a query that aggregates the things you care about.
But if you’ve got a document store, each item in your inventory is an isolated block of text. Sure you can in theory reference other documents - but those references are brittle. You don’t have an efficient way, for instance, to propagate changes through chains of references. Links break. The data decays over time, and you have to perform large costly
operations for instance to glean any kind of insight about your data as a whole.
So ironically all the reasons you opted not to use SQL turn out to be the very same reasons your No-SQL approach breaks down over time. Sure, a SQL database requires migrations to handle schema changes. But a No-SQL database can’t handle schema changes at all in the ways that matter over time. A SQL database requires you to identify tables and columns in advance - but a No-SQL database requires you to correctly guess what your domain model looks like in its entirety in advance, and then breaks your ability to easily examine it as a whole.
(There are still plenty of places where having document stores or KV stores is useful and correct. I’m not saying that No-SQL was a mistake in general - I’m just relaying my own experience, and talking about the way the industry at large over-corrected against relationality in a way that ended up having large costs.)
Big Data and Map/Reduce
While all of this was going on, you may remember the importance of something that industry referred to as “Big Data”. As the internet scaled up, tech companies starting capturing truly massive amounts of data from their users. Every click, every ad served, every query submitted - and the countless analytical observations derived from these things - ended up creating datasets that were simply too big to work with using conventional practices. Even a normalized SQL database is still constrained by the available memory it had to work with, and “Big Data” is what we called it when your dataset was just so large that you couldn’t hold it all in a single place. What do you do when the index for your database exceeds your available RAM? At some point it becomes cost prohibitive and then technologically infeasible to simply keep “scaling up”, we needed some new approaches.
In 2010 or so, if you had this problem, you were likely using a family of software called
which made it possible to orchestrate meaningful operations over “distributed databases”.
The idea was that if your data is so big it has to live on 100 different computers, how do you meaningfully query that data? Each instance in your cluster only knows about its own data. Sure you can query each instance and get back what it thinks is relevant, but how do you query against the entire dataset? And how do you perform aggregate operations, like finding the value of X that’s greater than the average of Y but less than the sum of Z, etc? These operations are simply intractable if your data is large enough.
Map/Reduce solved this by breaking the work down into a specific ordered set of operations. Think of this as a sort of specific application of Array#map and Array#reduce, where first you are mapping across all of the nodes in your cluster and extracting data salient in some way to the query. Now you’ve turned 100 distributed nodes into a single (still potentially very large) array of data. Then you perform a reduce operation against that data, such that your accumulator is persisted and passed from dataset to dataset in a deterministic way.
A Map/Reduce operation was long-running and prone to all sorts of complexity. It was important that you write your steps in specific ways, that you indexed on specific things, that you treated everything as immutable, etc. If a node in your cluster went down while the operation was running, you needed things structured in a certain way to allow recovery. None of this is the sort of thing you can just execute when the user loads a webpage - you needed a data engineering team.
The advantage here was that you could run Map/Reduce flows against arbitrarily large and even partially-structured data, as long as your map and reduce code encoded the relations you cared about consistently. This was the era of big data consultants raking in half a million dollars a year to tune indexes, optimize queries and make tiny config changes that shaved days off of expensive operations.
Sparks and Lakes
Map/Reduce was eventually replaced, in many ways, by things like
and other technologies that found other clever ways to do large queries at scale. The reason I include this story in this history is not that Map/Reduce itself was particularly interesting, but because it gave rise to a new way to organize data that was sort of orthogonal to the SQL/No-SQL debates of the time: the
As our ability to work with distributed databases improved, we generally realized that we didn’t actually need to store all of our data in the same way. If we are writing bespoke code to handle the partitions in a distributed database, why not just throw all of our data into one single bucket regardless of how it’s structured?
And so a data lake is just a place where all of your SQL dumps, all of your logs, all of your documents, everything you have, just gets aggregated into one place. And then you have tools that abstract over that and allow you to interface with it as if it was all one normalized database anyway.
Like a relational database, your domain data was distributed over a number of loosely coupled structured. But like a No-SQL database, the relationships between those structures wasn’t enforced by the database engine. And like a Map/Reduce operation, your queries aren’t going to be fast or reliable enough for this to become your database of record backing user interactions. A data lake is for analytical operations, where you ask questions after the fact to learn from your records.
Today, most companies use relational databases for their operational datastores (the things that drive user interactions), they use data from their relational databases to cache specific views in ephemeral No-SQL stores for specific purposes, and they use data lakes to aggregate and analyze the overall set of data that they accumulate as they run.
Want to print your doc? This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (