Explore

SCA Data Quality Framework

Maturation Curve

1) Awareness - Aware of DQ issues; collecting list of issues manually through exploration, working on projects, or errors in reporting raised by users. Good example is

Rabbit Hole Parking Lot⁠

⁠

2) Reactive (this is where we are) - Automated alerting based on data critical or business critical data quality checks that alert after data is in production. There is a risk that stakeholders are impacted and the data team is always working behind the eight ball; but the automation piece helps the team catch these issues and address them prior to being alerted by stakeholders

3) Proactive - The data team is able to prevent data critical failures from hitting production, while also being able to alert stakeholders of anomalies, bugs, and other negative trends through business critical checks

Types of Checks

Data Critical

These errors should be breaking errors for our data pipeline. Data impacted by these errors should be proactively prevented from getting to production as they pose a serious risk to the integrity of our data and reporting.

Completeness

Freshness

Grain integrity

Null handling

Business Critical

These errors are what we think of when we say "exception reporting". They are normally related to oddities from upstream data sources, funky/outdated logic, or unconventional system design/business processes. We should be aware and warned of these issues with the end goal of being able to quickly diagnose and triage.

Expected values

Column comparisons

Dimension volume

Anomaly detection

DQ Tools

Slack

⁠

#sca-etl-alerts⁠

notifies our team of Squarewave ETL failures and other data quality failures raised by Anomalo

⁠

#square-compliance-analytics-help⁠

allows stakeholders to reach our team in the event that they find DQ issues in our reporting or Snowflake tables

Anomalo

What is Anomalo?

Anomalo is a tool that data teams at Block leverage to run various data quality checks on their tables. Anomalo can check things such as daily freshness, expected values, table granularity, and other custom checks that can be created.

SCA Anomalo Configuration

The SCA team currently has all ETLs configured for fundamental daily checks under the

cax_sca⁠

label.

These daily checks include:

Data Freshness

Data Volume

Missing Data

Table Anomalies

The next evolution for using Anomalo is to introduce validation rules. Validation rules can cover:

Checking every value of a column

Checking a relationship between multiple columns

Compare multiple tables or SQL outputs

Check column names and data types

Check that a column defined in custom SQL logic is always true

Check that a custom SQL query returns no bad data

The final evolution for using Anomalo is to develop key metrics. Metrics can be defined using a variety of pre-defined aggregates such as: (Photo from this

article⁠

)

⁠

Screenshot 2024-03-08 at 12.53.27 PM.png

⁠

For more reading on Anomalo’s functionality and possibilities see their

docs⁠

JIRA

Any data quality bugs should be raised in JIRA as a bug ticket in the SCA team project. Create JIRA ticket

link⁠

⁠

Next Steps on Our Road to Proactive

Identify critical, warning, and contextual alerts/warnings/metrics our team wants to enable in the

Brainstorming doc⁠

⁠

Prioritize the critical and warning level alerts to solidify the foundation of our framework

Build out and test the contextual business alerts and metrics we’d like to create to test the full capacity of Anomalo

Explore avenues to prevent bad production data that fails critical tests from reaching stakeholders and reports

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.