icon picker
Sourcing + Storing Data with BigQuery

megaphone

You can source data from anywhere! Making use of BigQuery’s public datasets, querying APIs, and scraping web data can all be helpful. We’ll also cover how to push data into BigQuery, share it with collaborators, and export it into visualization tools.

Data Warehouses

Here’s a bit of a primer on how to work with data warehouses if you are not very familiar with them. A data warehouse is essentially a central repository of data that an organization can work out of. You can load tables into them, write SQL in the interface, and move data into other visualization tools. There are a number of data warehousing solutions out there on the market like Redshift, Snowflake, and BigQuery just to name a few. For our project, we decided to work with Google BigQuery. BigQuery is pretty easy to set up, relatively low cost, and has a familiar user interface. We’d recommend BigQuery to folks doing similar projects, and a piece of documentation that can help you get set up with BigQuery.

Sourcing Data

Big Query Public Datasets

There are a bunch of useful datasets already loaded into BigQuery. You can just write SQL to query these tables and save the new table into your project’s dataset! For example, we wanted to make use of the Census’s ACS data so that we could better understand demographic trends in the Detroit area.


This is an example of a base table we created — the resulting table has all of the ACS data for the zipcodes that are specified. If you are interested in working with Census data in particular, Google offers a bunch of these datasets in the bigquery-public-data project. You can simply a write a SQL select statement to grab the data you are interested in, and then use create table to place the data wherever makes sense.

Querying APIs


If you are looking to source data from a specific organization or about a particular topic and there’s an API hosted, you can write some Python code to query the API and move the data into BigQuery.

You’ll want to first identify the API docs for the API you are hoping to pull data from. For example, if you were doing a project analyzing restaurant trends, you might want to pull data from the Yelp API. You’d start by finding the (which are basically guides written by the authors of the API that help you navigate it) and then just writing a few simple lines of code to extract the data you want.

If you’ve never written Python before — no worries! Using the requests library to query APIs is pretty simple. an article on Python basics that might be helpful read, and a primer on using the requests library.


The code block above makes use of the Yelp API to store restaurant data from a given list of cities into a list of dictionaries that can be transformed into a table. We can take this data object into a table by using the Google BigQuery API.
You can create a table using the BigQuery API using the below code block as an example.

Scraping Web Data


If you want to pull data from a website, and there isn’t an API available that you can query, you can use Python’s Beautiful Soup library. Beautiful Soup is a package dedicated to pulling data out of HTML files. So for example, say you want to pull a chart out of a Wikipedia page and turn that into a BigQuery table.

The above code pulls in the Wikipedia page we want to extract data from, uses an understanding of HTML tags ( a quick link as introduction) to parse out the wanted data, and saves the data into a Python object.

You can then use the code from the last section to transform the Python objects into a BigQuery table.

Data Cleaning


Sometimes you might run into useful raw data somewhere on the internet, and it might be sitting in a nice CSV for you. However, the data might not be structured in a way that you can load into BigQuery or into a data visualization tool. Often when raw data is published on the internet, it is captured in a presentation format, maybe with counts and crosstabs for easy consumption. For example, we used the results of the Detroit Community Health Survey in our project. The original form was not suited for upload into Tableau because of the format it was in (linked
). By using Google Sheet formulas, we were able to standardize the counts (providing raw numbers instead of percentages) and make the spacing and organization more tenable, resulting in this final standardized version
.
You can read more about how to standardize messy data and how to implement Google Sheet formulas
.

Storing Data


All the above methods show you how to port data into BigQuery. BigQuery is a cloud-based data warehouse that makes it easy to store, manipulate, and port data. It’s a pretty accessible service — it’s free to create an account and store data. Running queries isn’t necessarily free, but it take a lot of queries and computational power till you start a bill. You can just create a project, create a dataset within the project, and then start creating tables to fill the dataset. You can write SQL on those tables and export your results into Google Data Studio, Tableau, or other visualization tools.

It’s also easy to work with collaborators in BigQuery. Sharing people on projects and tables is as easy as adding people to a Google Doc. a guide on how to do that.

/

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.